Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hard to reproduce the results of GLUE benchmark #5

Open
Harry-zzh opened this issue Aug 10, 2022 · 15 comments
Open

Hard to reproduce the results of GLUE benchmark #5

Harry-zzh opened this issue Aug 10, 2022 · 15 comments

Comments

@Harry-zzh
Copy link

Thanks for your excellent work.
I tried to do grid search on the settings that you described in your paper and codes, but it is still hard for me to reproduce the results of GLUE benchmark. My experiment results on both the dev set and test set are about 3% lower than yours.
I would be very grateful if you could offer exact experiment settings on each dataset, or codes that can reproduce the results of GLUE benchmark.
Looking forward to your reply, thank you !

@LittleMouseInCoding
Copy link

Same reproduce problem, can not get a result as good as published in paper. We really appreciate it if you can offer the experiment setting and clean code.
Thanks a lot!

@JetRunner
Copy link
Owner

Hi @Harry-zzh, thanks for your interest in our work! Just to confirm, 3% lower than reported means absolutely, right? Then this is lower than all baselines in Table 1, even vanilla KD?

@MichaelZhouwang and I will take a closer look and it'll be great if you can share the exact command used with us.

@MichaelZhouwang
Copy link
Contributor

Hi @Harry-zzh First, could you please share on which dataset you conduct your experiments? If it is some small datasets, 3% variation may indeed come from different random seeds. Otherwise, can you share the exact command for your best result on the task.

Also you may check the following points:

  1. Is your teacher achieving similar results as presented in the paper? And can you reproduce the results of KD and PKD reported in our paper on the same dataset? If not, its probably your pre-trained teacher or basic setup for BERT-KD/PKD is not correct.
  2. Are you initializing the student with fine-tuned teacher parameters? This should be achieved by setting the --student_model to be the same as the teacher model.
  3. For small to medium size datasets, you should adapt the --logging_rounds so that the model is evaluated at least 5 to 10 times per epoch in order to select the best performing checkpoint. This is especially important for MetaDistil because the peak performance often does not appear at the end of training.
  4. To get better performance you need to further tune some hyperparameters including the warmup_steps/temperature/KD weight/weight_decay.

@Harry-zzh
Copy link
Author

Thank your for your reply. @JetRunner @MichaelZhouwang

  1. The result for the test set on GLUE benchmark is listed as follows:
Model MRPC (F1/Acc.) RTE (Acc.) SST-2 (Acc.) STS-B (Pear./Spear.) MNLI ( Acc.) QNLI (Acc.) QQP (F1/Acc.)
Vanilla KD (mine) 86.2/80.3 64.7 91.7 83.4/81.9 80.4/79.8 87.5 69.7/88.6
Vanilla KD [1] 86.2/80.6 64.7 91.5 / 80.2/79.8 88.3 70.1/88.8
Model MRPC (F1/Acc.) RTE (Acc.) SST-2 (Acc.) STS-B (Pear./Spear.) MNLI ( Acc.) QNLI (Acc.) QQP (F1/Acc.)
Meta Distill (mine) 85.2/79.5 65.6 91.4 83.1/81.4 80.8/80.0 87.4 70.1/88.5
Meta Distill [2] 88.7/84.7 67.2 93.5 86.1/85.0 83.8/83.2 90.2 71.1/88.9

As you can see, almost all the results on the test set are 3% lower than your reported results. I can reproduce the results of KD listed in [1] but yours are significantly higher than theirs, I can’t reproduce.

Model MRPC (F1/Acc.) RTE (Acc.) SST-2 (Acc.) STS-B (Pear./Spear.) MNLI ( Acc.) QNLI (Acc.) QQP (F1/Acc.)
BERT-Base (mine) 89.0/85.2 69.5 93.2 87.2/85.9 84.3/83.9 91.1 71.5/89.2
BERT-Base [2] 88.9/84.8 66.4 93.5 87.1/85.8 84.6/83.4 90.5 71.2/89.2

And my teacher achieves even better performance than your reported results.

  1. I initialize the student with fine-tuned teacher parameters.
  2. For small and medium datasets, such as MRPC, RTE, and SST2, I set logging steps to 20. And for larger datasets, I set it to 500 or 1000.
  3. I try almost all hyper-parameters and fail to reach a reasonable result. I perform grid search over the sets of the student learning rate from {1e-5, 2e-5, 3e-5}, the teacher learning rate from {2e-6, 5e-6, 1e-5}, the weight of KD loss
    from {0.4, 0.5, 0.6}, the seed from {12,42,2022}, the temperature from {2, 5}, the warmup_steps from {100, 200}, the gradient accumulation step to {1, 2} . Since Meta Distill is computation-consuming, so I set the batch size to 32.

References:
[1] Sun S, Cheng Y, Gan Z, et al. Patient Knowledge Distillation for BERT Model Compression[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 4323-4332.
[2] Zhou W, Xu C, McAuley J. BERT learns to teach: Knowledge distillation with meta learning[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022: 7037-7049.

@JetRunner
Copy link
Owner

@Harry-zzh Thanks for the info. Is it on test set (i.e., GLUE server) or validation set? If it's on test set, could you please also provide the results on the development set?

@MichaelZhouwang could you give it a look?

@JetRunner
Copy link
Owner

JetRunner commented Aug 16, 2022

By the way, in NLP experiments, the students in our implementation of KD and our approach are initialized with pretrained BERT (well-read student) rather than fine-tuned teacher. That's probably the reason why the vanilla KD reported by us is significantly higher? (See the caption under Table 1)

@MichaelZhouwang
Copy link
Contributor

@Harry-zzh Can you share the exact command for your best result on the task? Also, can you share the results on the dev set of the GLUE benchmark? You can first focus on reproducing the results on the dev set.

@Harry-zzh
Copy link
Author

Thanks for your reply. I have shown the results on the test set before, and the results on the dev set are as follows:

Model MRPC (F1/Acc.) RTE (Acc.) SST-2 (Acc.) STS-B (Pear./Spear.) MNLI ( Acc.) QNLI (Acc.) QQP (F1/Acc.)
Vanilla KD (ours) 89.6/84.8 68.6 91.7 88.6/88.5 80.9 87.7 86.6/90.1
Meta Distill (ours) 89.4/84.3 69.3 91.3 88.3/88.0 81.3 87.9 87.2/90.4
BERT-Base (ours) 91.6/88.2 73.3 93.1 89.8/89.4 85.1 91.6 88.0/91.1

I try grid search over the sets of the hyper-parameters as I described before, and I choose the best checkpoint on the dev set to make predictions on the test set. An example of my command on MNLI dataset is :
python nlp/run_glue_distillation_meta.py --model_type bert --teacher_model nlp/bert-base-finetuned/mnli --student_model nlp/bert-base-finetuned/mnli --num_hidden_layers 6 --alpha 0.5 --task_name MNLI --do_train --do_eval --beta 0 --do_lower_case --data_dir nlp/glue_data/MNLI --assume_s_step_size 2e-05 --per_gpu_train_batch_size 32 --per_gpu_eval_batch_size 32 --learning_rate 2e-05 --teacher_learning_rate 2e-06 --max_seq_length 128 --num_train_epochs 5 --output_dir output/mnli --warmup_steps 200 --gradient_accumulation_steps 2 --temperature 5 --seed 42 --logging_rounds 1000 --save_steps 1000

And, @JetRunner said your approach is initialized with pretrained BERT (well-read student), and @MichaelZhouwang said your approach is initialized with fine-tuned teacher. I feel a bit confused.

Looking forward to your reply, and I would be grateful if you could offer exact experiment settings on each dataset.

@MichaelZhouwang
Copy link
Contributor

Hi, that was my mistake. The teacher is initialized by pretrained BERT (well-read student). But using fine-tuned teacher should be able to achieve similar performance.

First I think you should change --num_held_batches from 0 to something like 1/2/4, which introduces randomness for teacher training and speeds up the training process. Also, I think for MNLI you should use larger warmup steps such as 1000/1500/2000, a larger alpha such as 0.6/0.7, a lower temperature such as 2/3, a smaller logging_rounds such as 200, and a larger teacher lr such as 5e-6. And you may need to add some regularization such as weight_decay. You should be able to achieve something higher than 83.5 on MNLI without much difficulty.

For smaller datasets such as MRPC, a (effective) batch size of 64 is certainly too large. Also, you should carefully tune the warmup steps together with the num_train_epochs.

Unfortunately we can not offer you exact experiment settings on each dataset because we are no longer having it. Nevertheless, the notes above are some tips we can offer.

@MichaelZhouwang
Copy link
Contributor

For further questions, maybe you can send me an email with your wechat ID to wcszhou@outlook.com so that I can offer further guidance and help more promptly and conveniently.

@Harry-zzh
Copy link
Author

Hi, that was my mistake. The teacher is initialized by pretrained BERT (well-read student). But using fine-tuned teacher should be able to achieve similar performance.

First I think you should change --num_held_batches from 0 to something like 1/2/4, which introduces randomness for teacher training and speeds up the training process. Also, I think for MNLI you should use larger warmup steps such as 1000/1500/2000, a larger alpha such as 0.6/0.7, a lower temperature such as 2/3, a smaller logging_rounds such as 200, and a larger teacher lr such as 5e-6. And you may need to add some regularization such as weight_decay. You should be able to achieve something higher than 83.5 on MNLI without much difficulty.

For smaller datasets such as MRPC, a (effective) batch size of 64 is certainly too large. Also, you should carefully tune the warmup steps together with the num_train_epochs.

Unfortunately we can not offer you exact experiment settings on each dataset because we are no longer having it. Nevertheless, the notes above are some tips we can offer.

Thanks, I will have a try.

@Hakeyi
Copy link

Hakeyi commented Nov 2, 2022

Hi, that was my mistake. The teacher is initialized by pretrained BERT (well-read student). But using fine-tuned teacher should be able to achieve similar performance.
First I think you should change --num_held_batches from 0 to something like 1/2/4, which introduces randomness for teacher training and speeds up the training process. Also, I think for MNLI you should use larger warmup steps such as 1000/1500/2000, a larger alpha such as 0.6/0.7, a lower temperature such as 2/3, a smaller logging_rounds such as 200, and a larger teacher lr such as 5e-6. And you may need to add some regularization such as weight_decay. You should be able to achieve something higher than 83.5 on MNLI without much difficulty.
For smaller datasets such as MRPC, a (effective) batch size of 64 is certainly too large. Also, you should carefully tune the warmup steps together with the num_train_epochs.
Unfortunately we can not offer you exact experiment settings on each dataset because we are no longer having it. Nevertheless, the notes above are some tips we can offer.

Thanks, I will have a try.

@Harry-zzh hi, so, can you reproduce the results as shown in this paper now?

@amaraAI
Copy link

amaraAI commented Jan 3, 2023

@Harry-zzh @Hakeyi Hi! Were you able to reproduce the results? If yes, is it possible to share your findings? Thanks a lot!

@Harry-zzh
Copy link
Author

@Harry-zzh @Hakeyi Hi! Were you able to reproduce the results? If yes, is it possible to share your findings? Thanks a lot!

Sorry for late reply. I fail to reproduce the results.

@Harry-zzh
Copy link
Author

Hi, that was my mistake. The teacher is initialized by pretrained BERT (well-read student). But using fine-tuned teacher should be able to achieve similar performance.
First I think you should change --num_held_batches from 0 to something like 1/2/4, which introduces randomness for teacher training and speeds up the training process. Also, I think for MNLI you should use larger warmup steps such as 1000/1500/2000, a larger alpha such as 0.6/0.7, a lower temperature such as 2/3, a smaller logging_rounds such as 200, and a larger teacher lr such as 5e-6. And you may need to add some regularization such as weight_decay. You should be able to achieve something higher than 83.5 on MNLI without much difficulty.
For smaller datasets such as MRPC, a (effective) batch size of 64 is certainly too large. Also, you should carefully tune the warmup steps together with the num_train_epochs.
Unfortunately we can not offer you exact experiment settings on each dataset because we are no longer having it. Nevertheless, the notes above are some tips we can offer.

Thanks, I will have a try.

@Harry-zzh hi, so, can you reproduce the results as shown in this paper now?

Sorry for late reply. I fail to reproduce the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants