Stop GSM8k generation at double new line #147

OyvindTafjord · 2024-04-12T00:41:01Z

This loosens the stopping criteria for GSM8k generation, to allow for a single newline which is sometimes generated at the start of an output, but stops at a double newline (as compatible with the prompt).

It's a bit hacky, but should work for tokenizers where "\n\n" is either one or two tokens. Often "\n\n" is indeed a single token, so this should actually be more effective than the previous "\n" stopping condition if the model actual does generate a double new line. FYI @dwadden

hamishivi · 2024-04-12T22:03:20Z

I'm a little hesitant to merge this without checking if it changes our old evaluations. I wonder if a good solution would be to make the stop sequence configurable to avoid this issue. @yizhongw thoughts?

OyvindTafjord · 2024-04-12T22:14:21Z

Yeah, I agree the backwards compatibility is potentially awkward. We could put it behind a flag which would preserve earlier behavior. I'm not sure how strict you feel about not changing earlier evals. E.g., I noticed a typo in line 56 ("Quesion: " + example["question"]), is that something you also wouldn't want to change now?

hamishivi · 2024-04-14T20:21:41Z

Yeah, I think in general this is an awkward thing. I don't want to make changes when papers are in motion to avoid having to re-compute old evals, but also its worth tracking and updating these things, while also trying to avoid making some large number of flags. Really, coming up with some sort of 'release process'/versioning setup for this would be nice...

dwadden · 2024-04-26T22:52:41Z

I need this PR for evals I'm running now. Instead of changing the existing gsm task, could we just add another one that allows newlines?

OyvindTafjord · 2024-04-26T23:07:34Z

I added an option now so that this shouldn't change existing behavior unless the option --stop_at_double_newline is added to the run_eval call. E.g., in submit_eval_jobs.py would do this to get the new behavior:

    elif experiment_group == "gsm_cot":
        task_spec['arguments'][0] = '''
            python -m eval.gsm.run_eval \
            --data_dir /data/gsm/ \
            --max_num_examples 200 \
            --save_dir /output/ \
            --use_vllm \
            --model_name_or_path /model \
            --tokenizer_name_or_path /model \
            --n_shot 8 \
            --use_chat_format \
            --chat_formatting_function eval.templates.create_prompt_with_tulu_chat_format \
            --stop_at_double_newline
        '''

hamishivi · 2024-04-27T03:22:33Z

Thanks, thats fine! Probably not an ideal solution for all scenarios, but seems like a decent stopgap for now.

dwadden · 2024-04-27T04:40:02Z

Perfect, thanks! I'll rerun evals with this.

Stop GSM8k generation at double new line

f52495f

OyvindTafjord requested review from yizhongw and hamishivi April 12, 2024 00:41

Add --stop_at_double_newline option

68976cc

hamishivi approved these changes Apr 27, 2024

View reviewed changes

hamishivi merged commit f342459 into allenai:main Apr 27, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop GSM8k generation at double new line #147

Stop GSM8k generation at double new line #147

OyvindTafjord commented Apr 12, 2024 •

edited

Loading

hamishivi commented Apr 12, 2024

OyvindTafjord commented Apr 12, 2024

hamishivi commented Apr 14, 2024

dwadden commented Apr 26, 2024

OyvindTafjord commented Apr 26, 2024

hamishivi commented Apr 27, 2024

dwadden commented Apr 27, 2024

Stop GSM8k generation at double new line #147

Stop GSM8k generation at double new line #147

Conversation

OyvindTafjord commented Apr 12, 2024 • edited Loading

hamishivi commented Apr 12, 2024

OyvindTafjord commented Apr 12, 2024

hamishivi commented Apr 14, 2024

dwadden commented Apr 26, 2024

OyvindTafjord commented Apr 26, 2024

hamishivi commented Apr 27, 2024

dwadden commented Apr 27, 2024

OyvindTafjord commented Apr 12, 2024 •

edited

Loading