RayEstimator need more friendly error message to report memory error. #152

shanyu-sys · 2021-08-06T07:36:43Z

Better to guide user to tune their memory settings.

the way that yarn report memory error
capture the error and more friendly message

shanyu-sys · 2021-08-06T07:52:45Z

Yarn report error as below. It guide user to increase spark.yarn.executor.memoryOverhead
ERROR YarnScheduler:70 - Lost executor 1 on Almaren-Node-164: Container killed by YARN for exceeding memory limits. 160.2 GB of 160 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.

However it may take some time for user to locate it.
The last error messages user may find are:

Traceback (most recent call last):
  File "train_wnd_tf2.py", line 258, in <module>
    label_cols=[column_info.label])
  File "/root/anaconda3/envs/recsys-kai/lib/python3.7/site-packages/zoo/orca/learn/tf2/estimator.py", line 231, in fit
    ray_xshards = process_spark_xshards(data, self.num_workers)
  File "/root/anaconda3/envs/recsys-kai/lib/python3.7/site-packages/zoo/orca/data/utils.py", line 300, in process_spark_xshards
    ray_xshards = RayXShards.from_spark_xshards(data)
  File "/root/anaconda3/envs/recsys-kai/lib/python3.7/site-packages/zoo/orca/data/ray_xshards.py", line 355, in from_spark_xshards
    return RayXShards._from_spark_xshards_ray_api(spark_xshards)
  File "/root/anaconda3/envs/recsys-kai/lib/python3.7/site-packages/zoo/orca/data/ray_xshards.py", line 383, in _from_spark_xshards_ray_api
    ray.get([v.get_partitions.remote() for v in partition_stores.values()])
  File "/root/anaconda3/envs/recsys-kai/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/recsys-kai/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::LocalStore.get_partitions() (pid=24448, ip=172.16.0.169)
  File "python/ray/_raylet.pyx", line 439, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
RuntimeError: The actor with name LocalStore failed to be imported, and so cannot execute this method.

And

(pid=175199, ip=172.16.0.114) 2021-07-28 13:27:48,099   ERROR function_manager.py:498 -- Failed to load actor class LocalStore.
(pid=175199, ip=172.16.0.114) Traceback (most recent call last):
(pid=175199, ip=172.16.0.114)   File "/disk4/yarn/nm/usercache/root/appcache/application_1626654036089_0548/container_1626654036089_0548_01_000008/python_en
v/lib/python3.7/site-packages/ray/function_manager.py", line 496, in _load_actor_class_from_gcs
(pid=175199, ip=172.16.0.114) ModuleNotFoundError: No module named 'zoo'

shanyu-sys self-assigned this Aug 6, 2021

shanyu-sys mentioned this issue Aug 30, 2021

Add ray daemon to kill ray processes intel-analytics/BigDL-2.x#4571

Merged

5 tasks

liu-shaojun transferred this issue from intel-analytics/BigDL-2.x Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RayEstimator need more friendly error message to report memory error. #152

RayEstimator need more friendly error message to report memory error. #152

shanyu-sys commented Aug 6, 2021 •

edited

Loading

shanyu-sys commented Aug 6, 2021

RayEstimator need more friendly error message to report memory error. #152

RayEstimator need more friendly error message to report memory error. #152

Comments

shanyu-sys commented Aug 6, 2021 • edited Loading

shanyu-sys commented Aug 6, 2021

shanyu-sys commented Aug 6, 2021 •

edited

Loading