Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

推理代码bug反馈 #585

Open
shenshaowei opened this issue Sep 18, 2024 · 8 comments
Open

推理代码bug反馈 #585

shenshaowei opened this issue Sep 18, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@shenshaowei
Copy link

shenshaowei commented Sep 18, 2024

用无微调的OneKe推理,bits设置8:

CUDA_VISIBLE_DEVICES=7 python src/inference.py \
    --stage sft \
    --model_name_or_path '/data3/shensw/model/OneKE' \
    --model_name 'llama' \
    --template 'llama2_zh' \
    --do_predict \
    --input_file 'data/NER/oneke_test_change.json' \
    --output_file 'results/oneke_test_change_noft_output_6.json' \
    --output_dir 'result' \
    --predict_with_generate \
    --cutoff_len 512 \
    --bf16 \
    --max_new_tokens 300 \
    --bits 8

报错:
loader.pymodel = model.to(model_args.compute_dtype) if model_args.bits >= 8 else model报错:ValueError: .to is not supported for 4-bit or 8-bit bitsandbytes models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct dtype.
bits 16和32没问题,
微调后:

CUDA_VISIBLE_DEVICES=5 python src/inference.py \
    --stage sft \
    --model_name_or_path '/data3/shensw/model/OneKE' \
    --checkpoint_dir '/data3/shensw/DeepKE/example/llm/InstructKGC/lora/oneke-continue-bits8/checkpoint-27260' \
    --model_name 'llama' \
    --template 'llama2_zh' \
    --do_predict \
    --input_file 'data/NER/oneke_test_change.json' \
    --output_file 'results/oneke_test_change_lora_output_6-1.json' \
    --finetuning_type lora \
    --output_dir 'result' \
    --predict_with_generate \
    --cutoff_len 512 \
    --bf16 \
    --max_new_tokens 300 \
    --bits 8

bits8速度出奇的慢,明明量化了,为什么速度这么慢,设置成16就正常多了,我看用了bitsandbytes库做量化了,代码是存在什么bug吗?
最后弱弱问一句,export_model.py的用途是什么,把OneKe微调后的模型合并成和OneKe模型一样的格式吗

@shenshaowei shenshaowei added the bug Something isn't working label Sep 18, 2024
@shenshaowei
Copy link
Author

shenshaowei commented Sep 18, 2024

你好,请问微调完,想写一个py文件快速推理,参考您给的modelscope里写的代码:

import torch
from transformers import (
    AutoConfig,
    AutoTokenizer,
    AutoModelForCausalLM,
    GenerationConfig,
    BitsAndBytesConfig
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_path = 'zjunlp/OneKE'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)


# 4bit量化OneKE
quantization_config=BitsAndBytesConfig(     
    load_in_4bit=True,
    llm_int8_threshold=6.0,
    llm_int8_has_fp16_weight=False,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    config=config,
    device_map="auto",  
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
model.eval()


system_prompt = '<<SYS>>\nYou are a helpful assistant. 你是一个乐于助人的助手。\n<</SYS>>\n\n'
sintruct = "{\"instruction\": \"You are an expert in named entity recognition. Please extract entities that match the schema definition from the input. Return an empty list if the entity type does not exist. Please respond in the format of a JSON string.\", \"schema\": [\"person\", \"organization\", \"else\", \"location\"], \"input\": \"284 Robert Allenby ( Australia ) 69 71 71 73 , Miguel Angel Martin ( Spain ) 75 70 71 68 ( Allenby won at first play-off hole )\"}"
sintruct = '[INST] ' + system_prompt + sintruct + '[/INST]'

input_ids = tokenizer.encode(sintruct, return_tensors="pt").to(device)
input_length = input_ids.size(1)
generation_output = model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_length=1024, max_new_tokens=512, return_dict_in_generate=True), pad_token_id=tokenizer.eos_token_id)
generation_output = generation_output.sequences[0]
generation_output = generation_output[input_length:]
output = tokenizer.decode(generation_output, skip_special_tokens=True)

print(output)

微调完怎么把checkpoint及模型整合到类似model_path = 'zjunlp/OneKE’模型的文件夹里

@guihonghao
Copy link
Contributor

export_model.py的用途是把底座模型的权重和lora参数合并成一个新的模型。

@shenshaowei
Copy link
Author

合成后就可以用快速推理代码进行使用了吗,据测试,快速推理的时间要比inference要慢不少,而且在inference.py测试的测试数据的输出与合成模型+简单推理输出不一致,请问这个怎么解决呢?

@guihonghao
Copy link
Contributor

看是否4bits量化,快速推理若有量化,则推理速度会慢一些

@shenshaowei
Copy link
Author

shenshaowei commented Sep 18, 2024

微调时bits设置为8,训练完我bits设置为16比bits8和4快很多,但精度没有什么区别,正常来说量化不应该更快吗,快速推理代码

quantization_config=BitsAndBytesConfig(     
    load_in_4bit=True,
    llm_int8_threshold=6.0,
    llm_int8_has_fp16_weight=False,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

不设置会更快一些?另外,lora微调不量化的话,bits是设置为16or32吗,训练一轮要太久了,所以请教您一下

@shenshaowei
Copy link
Author

支持训练后量化吗,是否提供代码

@guihonghao
Copy link
Contributor

微调时bits设置为8,训练完我bits设置为16比bits8和4快很多,但精度没有什么区别,正常来说量化不应该更快吗,快速推理代码

quantization_config=BitsAndBytesConfig(     
    load_in_4bit=True,
    llm_int8_threshold=6.0,
    llm_int8_has_fp16_weight=False,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

不设置会更快一些?另外,lora微调不量化的话,bits是设置为16or32吗,训练一轮要太久了,所以请教您一下

量化后会更慢,lora微调不量化的话,bits是设置为16or32

@guihonghao
Copy link
Contributor

训练、推理是否量化均只需要修改--bits 8这一参数,16和32表示不量化,8、4表示量化。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants