Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wombat-7B,Wombat-7B-gpt4 and ChatGPT Results on Comparison based on Vicuna test set, evaluation by gpt-4. #18

Open
onlyfish79 opened this issue Apr 24, 2023 · 4 comments

Comments

@onlyfish79
Copy link

  1. Wombat-7B and ChatGPT Comparison based on Vicuna test set, score by GPT-4 Evaluation.
Wombat-7B: 599.0  average score: 7.5
ChatGPT: 710.5    average score: 8.9
wombat-7b / gpt35 = 84.31%
  1. Wombat-7B-gpt4 and ChatGPT Comparison based on Vicuna test set, score by GPT-4 Evaluation.
Wombat-7B-gpt4: 577.0  average score: 7.2
ChatGPT: 734.5         average score: 9.2
wombat-7b-gpt4 / gpt35 = 78.13%

Wombat-7B and Wombat-7B-gpt4: use the script recover_wombat_7b.sh

According to the above results, Wombat-7B has better results than Wombat-7B-gpt4, does the result meet expectations?

@GanjinZero
Copy link
Owner

Yes, it does meet our expectations, and we observe a similar score in Wombat-7B-gpt4 vs ChatGPT.
The reason is Wombat-7B uses 5 responses for one query to train RRHF.
Although Wombat-7B-gpt4 uses better responses, but it only contain 2 responses for one query.
We think more diverse responses are the most important point of training RRHF.

@GanjinZero
Copy link
Owner

Another possible thing is Wombat-7B use responses from its initial checkpoint, while Wombat-7B-gpt4 does not use the response from its initial checkpoint.
If RRHF is trying to improve based on itself, not using responses from its initial checkpoint worse RRHF's performance.

@onlyfish79
Copy link
Author

Understood, thank you for your response.
May I ask about the upcoming roadmap for RRHF?

@GanjinZero
Copy link
Owner

Understood, thank you for your response. May I ask about the upcoming roadmap for RRHF?

Chain of thought reasoning & Scaling to 13b, 30b, 65b llama / alpaca

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants