-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A common interface for APIs and Models. #161
Comments
Hi, thanks for the suggestion, I think some challenges in the evaluation of these models are that they might change & evolve behind the API which makes the evaluation numbers not very relevant over time. They also might require different post-processing to extract the code snippet since they tend to generate natural text before and after so i'm not sure the current approach we have will work out of the box for most tasks. However if you do tests and find your implementation to work/match public numbers for certain tasks like Regarding your indentation issue, I think the prompt is stripped by default and doesn't have a |
I tried some of the things above mentioned, but everything just solved by giving a simple prompt. Does that make it a valid solution? For example for HumanEval, the problem solved when I added this prompt
And the model, I considered was |
Maybe check this code that OctoCoder authors submitted for evaluation of OpenAI models on HumanEvalSynthesize https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/bigcode_eval/tasks/humanevalpack_openai.py |
Very interesting and weird thing, I used But just adding One reason for this could be the time codellama evaluated and the time I am doing the evaluation, gpt-3.5 got evolved over. And I am not sure whether they are contaminated with same examples or not. |
@Anindyadeep Can you open source your projec. |
Yeah, we will do that shortly :) |
@loubnabnl I did not checked out So, can you share is the below mentioned interface is okay, if I put the PR? Feel free to suggest me changes if any.
|
Yes feel free to open a PR and add the scores you got |
Hi @loubnabnl, I started a PR. Let me know which benchmarks, I need to evaluate through this so that I can add the results too. Thanks |
Summary of the issue
First of all, thanks for the awesome effort for making code-evaluation-package. Highly appreciate it. However, right now, what I see is that it is integrated with just Huggingface models. It would be awesome, if we can evaluate the same for closed source models. For example something like this would be awesome:
So, with the same interface and the post processing logic of code-evaluation-harness, we can leverage this, to evaluate and compare code-evaluation for open source and closed source models.
What is the motivation
The motivation behind this is that Open-Source models are all good. However, researchers and passionate people on LLM always strive for making models that can surpass gpt in lesser number of parameters and better in performance for certain tasks. And a library like this would be really helpful.
How can I contribute:
Well, I already have most part of this code ready. If you are aligned with the motivations of the issue, then I can create the PR. However, the problem, that I am facing is that the evaluation score for API based models are very low. For example
gpt-3.5
is giving a score for0.006
in HumanEval benchmark. However, the generation is correct. The problem is in indentation and the post-processing of the generations. For example, one instance of the generation ofgpt-3.5.turbo
looks something like this:If we see the above code, then we can see, the problem is in indentation, for which during the time of evaluation, we are getting it marked as wrong. Although I tried to implement the code of big-code's post processing for different tasks. But, it was not working. So I would highly appreciate in some help there.
The text was updated successfully, but these errors were encountered: