Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation hangs after first validation sample when using deepspeed stage 3 with multi-gpu #15

Open
AceMcAwesome77 opened this issue Aug 27, 2024 · 1 comment

Comments

@AceMcAwesome77
Copy link

AceMcAwesome77 commented Aug 27, 2024

Hi, thanks again for this excellent repo. The model trains fine for me on a multi-gpu system using deepspeed stage 2 with or without offload, and I see the expected training/validation time reduction. However, when I train with deepspeed stage 3, the training_step progresses fine, but the model hangs indefinitely immediately after the validation step starts and one validation sample is completed. By adding logging, I was able to determine that the model generates tokens one-by-one for the multiple GPUs simulataneously as expected, but as soon as one GPU is done generating the sequence for its validation sample, all the other GPUs hang immediately afterwards. This makes me suspect the issue has something to do with synchronization across the GPUs, where it will not allow the other GPUs to continue to go through calls to the forward function which generates tokens one-by-one during validation here, since one GPU has already finished that part. However I don't understand why this issue would manifest only during stage 3 and not also during stage 2.

Has anyone else successfully trained this model, including the validation step, using deepspeed stage 3? I've tried adjusting some of the deepspeed_config parameters, but no success yet. Thanks!

@AceMcAwesome77
Copy link
Author

AceMcAwesome77 commented Aug 27, 2024

I think this thread may be relevant to the problem:

microsoft/DeepSpeed#860

I thought I could potentially solve this issue by setting eos_token_id=None in the self.encoder_decoder.generate() function and letting all GPUs generate the same number of tokens during validation_step, but then I got a new error that was something like "disagreement between rank0 and rank4" that I don't know how to fix. But I do think stas00 is probably correct about the issue in that thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant