推理代码bug反馈 #585

shenshaowei · 2024-09-18T02:55:53Z

用无微调的OneKe推理，bits设置8：

CUDA_VISIBLE_DEVICES=7 python src/inference.py \
    --stage sft \
    --model_name_or_path '/data3/shensw/model/OneKE' \
    --model_name 'llama' \
    --template 'llama2_zh' \
    --do_predict \
    --input_file 'data/NER/oneke_test_change.json' \
    --output_file 'results/oneke_test_change_noft_output_6.json' \
    --output_dir 'result' \
    --predict_with_generate \
    --cutoff_len 512 \
    --bf16 \
    --max_new_tokens 300 \
    --bits 8

报错：
loader.py里model = model.to(model_args.compute_dtype) if model_args.bits >= 8 else model报错：ValueError: .to is not supported for 4-bit or 8-bit bitsandbytes models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct dtype.
bits 16和32没问题，
微调后：

CUDA_VISIBLE_DEVICES=5 python src/inference.py \
    --stage sft \
    --model_name_or_path '/data3/shensw/model/OneKE' \
    --checkpoint_dir '/data3/shensw/DeepKE/example/llm/InstructKGC/lora/oneke-continue-bits8/checkpoint-27260' \
    --model_name 'llama' \
    --template 'llama2_zh' \
    --do_predict \
    --input_file 'data/NER/oneke_test_change.json' \
    --output_file 'results/oneke_test_change_lora_output_6-1.json' \
    --finetuning_type lora \
    --output_dir 'result' \
    --predict_with_generate \
    --cutoff_len 512 \
    --bf16 \
    --max_new_tokens 300 \
    --bits 8

bits8速度出奇的慢，明明量化了，为什么速度这么慢，设置成16就正常多了，我看用了bitsandbytes库做量化了，代码是存在什么bug吗？
最后弱弱问一句，export_model.py的用途是什么，把OneKe微调后的模型合并成和OneKe模型一样的格式吗

The text was updated successfully, but these errors were encountered:

shenshaowei · 2024-09-18T05:56:52Z

你好，请问微调完，想写一个py文件快速推理，参考您给的modelscope里写的代码：

import torch
from transformers import (
    AutoConfig,
    AutoTokenizer,
    AutoModelForCausalLM,
    GenerationConfig,
    BitsAndBytesConfig
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_path = 'zjunlp/OneKE'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)


# 4bit量化OneKE
quantization_config=BitsAndBytesConfig(     
    load_in_4bit=True,
    llm_int8_threshold=6.0,
    llm_int8_has_fp16_weight=False,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    config=config,
    device_map="auto",  
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
model.eval()


system_prompt = '<<SYS>>\nYou are a helpful assistant. 你是一个乐于助人的助手。\n<</SYS>>\n\n'
sintruct = "{\"instruction\": \"You are an expert in named entity recognition. Please extract entities that match the schema definition from the input. Return an empty list if the entity type does not exist. Please respond in the format of a JSON string.\", \"schema\": [\"person\", \"organization\", \"else\", \"location\"], \"input\": \"284 Robert Allenby ( Australia ) 69 71 71 73 , Miguel Angel Martin ( Spain ) 75 70 71 68 ( Allenby won at first play-off hole )\"}"
sintruct = '[INST] ' + system_prompt + sintruct + '[/INST]'

input_ids = tokenizer.encode(sintruct, return_tensors="pt").to(device)
input_length = input_ids.size(1)
generation_output = model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_length=1024, max_new_tokens=512, return_dict_in_generate=True), pad_token_id=tokenizer.eos_token_id)
generation_output = generation_output.sequences[0]
generation_output = generation_output[input_length:]
output = tokenizer.decode(generation_output, skip_special_tokens=True)

print(output)

微调完怎么把checkpoint及模型整合到类似model_path = 'zjunlp/OneKE’模型的文件夹里

guihonghao · 2024-09-18T06:07:53Z

export_model.py的用途是把底座模型的权重和lora参数合并成一个新的模型。

shenshaowei · 2024-09-18T06:28:43Z

合成后就可以用快速推理代码进行使用了吗，据测试，快速推理的时间要比inference要慢不少，而且在inference.py测试的测试数据的输出与合成模型+简单推理输出不一致，请问这个怎么解决呢？

guihonghao · 2024-09-18T06:31:31Z

看是否4bits量化，快速推理若有量化，则推理速度会慢一些

shenshaowei · 2024-09-18T06:43:33Z

微调时bits设置为8，训练完我bits设置为16比bits8和4快很多，但精度没有什么区别，正常来说量化不应该更快吗，快速推理代码

quantization_config=BitsAndBytesConfig(     
    load_in_4bit=True,
    llm_int8_threshold=6.0,
    llm_int8_has_fp16_weight=False,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

不设置会更快一些？另外，lora微调不量化的话，bits是设置为16or32吗，训练一轮要太久了，所以请教您一下

shenshaowei · 2024-09-19T05:54:55Z

支持训练后量化吗，是否提供代码

guihonghao · 2024-09-19T06:20:43Z

微调时bits设置为8，训练完我bits设置为16比bits8和4快很多，但精度没有什么区别，正常来说量化不应该更快吗，快速推理代码
quantization_config=BitsAndBytesConfig(     
    load_in_4bit=True,
    llm_int8_threshold=6.0,
    llm_int8_has_fp16_weight=False,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)
不设置会更快一些？另外，lora微调不量化的话，bits是设置为16or32吗，训练一轮要太久了，所以请教您一下

量化后会更慢，lora微调不量化的话，bits是设置为16or32

guihonghao · 2024-09-19T06:21:50Z

训练、推理是否量化均只需要修改--bits 8这一参数，16和32表示不量化，8、4表示量化。

shenshaowei added the bug Something isn't working label Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

推理代码bug反馈 #585

推理代码bug反馈 #585

shenshaowei commented Sep 18, 2024 •

edited

Loading

shenshaowei commented Sep 18, 2024 •

edited

Loading

guihonghao commented Sep 18, 2024

shenshaowei commented Sep 18, 2024

guihonghao commented Sep 18, 2024

shenshaowei commented Sep 18, 2024 •

edited

Loading

shenshaowei commented Sep 19, 2024

guihonghao commented Sep 19, 2024

guihonghao commented Sep 19, 2024

推理代码bug反馈 #585

推理代码bug反馈 #585

Comments

shenshaowei commented Sep 18, 2024 • edited Loading

shenshaowei commented Sep 18, 2024 • edited Loading

guihonghao commented Sep 18, 2024

shenshaowei commented Sep 18, 2024

guihonghao commented Sep 18, 2024

shenshaowei commented Sep 18, 2024 • edited Loading

shenshaowei commented Sep 19, 2024

guihonghao commented Sep 19, 2024

guihonghao commented Sep 19, 2024

shenshaowei commented Sep 18, 2024 •

edited

Loading

shenshaowei commented Sep 18, 2024 •

edited

Loading

shenshaowei commented Sep 18, 2024 •

edited

Loading