Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix missing DDP in torch distributed #3185

Merged
merged 6 commits into from
Dec 7, 2020

Conversation

hkvision
Copy link
Contributor

@hkvision hkvision commented Dec 4, 2020

intel-analytics/analytics-zoo#528
Fix for torch_distributed backend.


metric_meters.update(metrics, n=metrics.pop(NUM_SAMPLES, 1))
self.global_step += 1
with self.model.join():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can handle uneven data for different workers, but only available at torch 1.7.0: https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove this patch first since our test environment doesn't have torch 1.7.0.
Can add later if we target to support torch 1.7.0

@yangw1234
Copy link
Contributor

yangw1234 commented Dec 6, 2020

Is it possible to write a unit test?

E.g. checking if different workers has the same weights after training.

@hkvision
Copy link
Contributor Author

hkvision commented Dec 7, 2020

Is it possible to write a unit test?

E.g. checking if different workers has the same weights after training.

Added. Take a look.

@hkvision
Copy link
Contributor Author

hkvision commented Dec 7, 2020

http://10.239.47.210:18888/view/ZOO-PR/job/ZOO-PR-Validation/4715/ One example test gets crashed. All other tests passed. Merge it first.

@hkvision hkvision merged commit 939dd28 into intel-analytics:master Dec 7, 2020
@hkvision hkvision deleted the fix-ddp branch December 7, 2020 09:09
hkvision added a commit to hkvision/analytics-zoo that referenced this pull request Dec 7, 2020
* fix ddp

* add model join

* fix

* add ut

* remove join

* remove debug msg
hkvision added a commit that referenced this pull request Dec 7, 2020
* fix ddp

* add model join

* fix

* add ut

* remove join

* remove debug msg
yangw1234 pushed a commit to yangw1234/analytics-zoo that referenced this pull request Sep 23, 2021
* fix ddp

* add model join

* fix

* add ut

* remove join

* remove debug msg
yangw1234 pushed a commit to yangw1234/analytics-zoo that referenced this pull request Sep 23, 2021
* fix ddp

* add model join

* fix

* add ut

* remove join

* remove debug msg
yangw1234 pushed a commit to yangw1234/analytics-zoo that referenced this pull request Sep 26, 2021
* fix ddp

* add model join

* fix

* add ut

* remove join

* remove debug msg
yangw1234 pushed a commit to yangw1234/analytics-zoo that referenced this pull request Sep 26, 2021
* fix ddp

* add model join

* fix

* add ut

* remove join

* remove debug msg
yangw1234 pushed a commit that referenced this pull request Sep 27, 2021
* fix ddp

* add model join

* fix

* add ut

* remove join

* remove debug msg
dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Oct 4, 2021
* fix ddp

* add model join

* fix

* add ut

* remove join

* remove debug msg
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants