Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pretraining error reporting in APPLE M3 pro device #446

Open
zh-zhang1984 opened this issue Aug 14, 2024 · 0 comments
Open

pretraining error reporting in APPLE M3 pro device #446

zh-zhang1984 opened this issue Aug 14, 2024 · 0 comments

Comments

@zh-zhang1984
Copy link

When I am following the instruction to pretrain the model, I get the following error reporting:
I am using APPLE M3 pro device; How may I solve this issue in the example:

ValueError: mutable default <class 'fairseq.dataclass.configs.CommonConfig'> for field common is not allowed: use default_factory
E0814 10:20:55.577000 8228223680 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 6809) of binary: /opt/anaconda3/bin/python3
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/opt/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py", line 198, in <module>
    main()
  File "/opt/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py", line 194, in main
    launch(args)
  File "/opt/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py", line 179, in launch
    run(args)
  File "/opt/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/opt/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
../../train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-14_10:20:55
  host      : 1.0.0.127.in-addr.arpa
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 6809)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant