Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RayEstimator need more friendly error message to report memory error. #152

Open
2 tasks
shanyu-sys opened this issue Aug 6, 2021 · 1 comment
Open
2 tasks
Assignees

Comments

@shanyu-sys
Copy link
Contributor

shanyu-sys commented Aug 6, 2021

Better to guide user to tune their memory settings.

  • the way that yarn report memory error
  • capture the error and more friendly message
@shanyu-sys shanyu-sys self-assigned this Aug 6, 2021
@shanyu-sys
Copy link
Contributor Author

Yarn report error as below. It guide user to increase spark.yarn.executor.memoryOverhead
ERROR YarnScheduler:70 - Lost executor 1 on Almaren-Node-164: Container killed by YARN for exceeding memory limits. 160.2 GB of 160 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.

However it may take some time for user to locate it.
The last error messages user may find are:

Traceback (most recent call last):
  File "train_wnd_tf2.py", line 258, in <module>
    label_cols=[column_info.label])
  File "/root/anaconda3/envs/recsys-kai/lib/python3.7/site-packages/zoo/orca/learn/tf2/estimator.py", line 231, in fit
    ray_xshards = process_spark_xshards(data, self.num_workers)
  File "/root/anaconda3/envs/recsys-kai/lib/python3.7/site-packages/zoo/orca/data/utils.py", line 300, in process_spark_xshards
    ray_xshards = RayXShards.from_spark_xshards(data)
  File "/root/anaconda3/envs/recsys-kai/lib/python3.7/site-packages/zoo/orca/data/ray_xshards.py", line 355, in from_spark_xshards
    return RayXShards._from_spark_xshards_ray_api(spark_xshards)
  File "/root/anaconda3/envs/recsys-kai/lib/python3.7/site-packages/zoo/orca/data/ray_xshards.py", line 383, in _from_spark_xshards_ray_api
    ray.get([v.get_partitions.remote() for v in partition_stores.values()])
  File "/root/anaconda3/envs/recsys-kai/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/recsys-kai/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::LocalStore.get_partitions() (pid=24448, ip=172.16.0.169)
  File "python/ray/_raylet.pyx", line 439, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
RuntimeError: The actor with name LocalStore failed to be imported, and so cannot execute this method.

And

(pid=175199, ip=172.16.0.114) 2021-07-28 13:27:48,099   ERROR function_manager.py:498 -- Failed to load actor class LocalStore.
(pid=175199, ip=172.16.0.114) Traceback (most recent call last):
(pid=175199, ip=172.16.0.114)   File "/disk4/yarn/nm/usercache/root/appcache/application_1626654036089_0548/container_1626654036089_0548_01_000008/python_en
v/lib/python3.7/site-packages/ray/function_manager.py", line 496, in _load_actor_class_from_gcs
(pid=175199, ip=172.16.0.114) ModuleNotFoundError: No module named 'zoo'

@liu-shaojun liu-shaojun transferred this issue from intel-analytics/BigDL-2.x Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant