update hyperzoo doc and k8s doc #3959

Adria777 · 2021-05-19T07:57:34Z

add k8s configuration in the readme of hyperzoo
add 7 notes of run az examples on k8s:

user can use NFS to save local files
user can add temp-dir to get worker log
user should specify "step-per-epoch" and "validation-steps" when run more then one executor
add memory when "RayActorError" occurs
"spark.kubernetes.container.image.pullPolicy" needs to be specified as "Always"
when use NFS in k8s cluster, NFS should be set in driver and executor
if more than 1 executor, remove extra_params = {"temp-dir": "/tmp/ray/"}

glorysdj · 2021-05-20T00:43:44Z

docs/readthedocs/source/doc/UserGuide/k8s.md

+**Note**: The k8s would delete the pod once the worker failed in client mode and cluster mode. If you want to get the content of of worker log, you could set an "temp-dir" to change the log dir to replace the former one. Please note that in this case you should set num-nodes to 1 if you use network file system (NFS).  Otherwise, it would cause error because the temp-dir and NFS are not point to the same directory.
+
+```python
+init_orca_context(..., extra_params = {"temp-dir": "/tmp/ray/"})


we should also add note that if more than 1 executor, please rm extra_params = {"temp-dir": "/tmp/ray/"} since conflicts of writes will happen and JSONDecodeError will happen

add note

update notes

glorysdj

LGTM

jason-dai · 2021-05-20T07:17:10Z

docs/readthedocs/source/doc/UserGuide/k8s.md

@@ -125,6 +125,26 @@ init_orca_context(cluster_mode="k8s", master="k8s://https://<k8s-apiserver-host>

 Execute `python script.py` to run your program on k8s cluster directly.

+**Note**: The k8s client and cluster mode do not support download files to local, logging callback, tensorboard callback, etc. If you have these requirements, it's a good idea to use network file system (NFS).


It is not clear how "logging callback, tensorboard callback, etc." are related to NFS? And please specify how the user can use NFS in this case.

yes, we should add the guide for how to mount k8s PERSISTENT_VOLUME_CLAIM to spark executor and driver pods with configs:

--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \ --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/zoo \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/zoo

For logging and tensorboard callbacks, if the outputs need to be persisted out of pod's lifecycle, user need to set the output dir to the mounted persistent vloume dir. NFS is a simple example.

jason-dai · 2021-05-20T07:21:12Z

docs/readthedocs/source/doc/UserGuide/k8s.md

+**Note**: The k8s client and cluster mode do not support download files to local, logging callback, tensorboard callback, etc. If you have these requirements, it's a good idea to use network file system (NFS).
+
+**Note**: The k8s would delete the pod once the worker failed in client mode and cluster mode. If you want to get the content of of worker log, you could set an "temp-dir" to change the log dir to replace the former one. Please note that in this case you should set num-nodes to 1 if you use network file system (NFS).  Otherwise, it would cause error because the temp-dir and NFS are not point to the same directory.
+


If this is a common issue for both client and cluster mode, you should put it outside of section 3.1.

And it is not clear

What value should be for "temp-dir"? A local folder? An NFS folder?

What do you mean by "if you use network file system (NFS)"?

What you mean by "the temp-dir and NFS are not point to the same directory"?

this is a common issue for both client and cluster node.
we need to add a section for how to debug ray logs in k8s, we should try spark local storage with same temp-dir or improve our ray start with different temp-dir when use a public storage.

jason-dai · 2021-05-20T07:23:19Z

docs/readthedocs/source/doc/UserGuide/k8s.md

+init_orca_context(..., extra_params = {"temp-dir": "/tmp/ray/"})
+```
+
+If you training with more than 1 executor and use NFS, please remove `extra_params = {"temp-dir": "/tmp/ray/"}`. Because there would be conflict if multiple executors write files in the same directory at the same time. It may cause JSONDecodeError.


Is this the same as "Please note that in this case you should set num-nodes to 1 if you use network file system (NFS)"?

What do you mean by "If you training with more than 1 executor and use NFS"?

And does it mean that if training with more than 1 executor, there is no way for the user to get the logs?

jason-dai · 2021-05-20T07:23:41Z

docs/readthedocs/source/doc/UserGuide/k8s.md

+
+If you training with more than 1 executor and use NFS, please remove `extra_params = {"temp-dir": "/tmp/ray/"}`. Because there would be conflict if multiple executors write files in the same directory at the same time. It may cause JSONDecodeError.
+
+**Note**: If you training with more than 1 executor, please make sure you set proper "steps_per_epoch" and "validation steps".


How to set proper "steps_per_epoch" and "validation steps"?

jason-dai · 2021-05-20T07:24:43Z

docs/readthedocs/source/doc/UserGuide/k8s.md

+
+**Note**: If you training with more than 1 executor, please make sure you set proper "steps_per_epoch" and "validation steps".
+
+**Note**: "spark.kubernetes.container.image.pullPolicy" needs to be specified as "always"


Otherwise? Is it also needed for cluster mode?

And is there a way to set this automatically for the user?

it is a common settings for both client and cluster mode, we should move this to public section, and the default value is "IfNotPresent" so can not be automatically set.

jason-dai · 2021-05-20T07:25:23Z

docs/readthedocs/source/doc/UserGuide/k8s.md

+
+**Note**: "spark.kubernetes.container.image.pullPolicy" needs to be specified as "always"
+
+**Note**: if  "RayActorError" occurs, try to increase the memory


RayActorError is a generic error; are there other specific error messages? And is it also needed for cluster mode?

jason-dai · 2021-05-20T07:27:32Z

docs/readthedocs/source/doc/UserGuide/k8s.md

@@ -151,6 +171,18 @@ ${ANALYTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh \
  file:///path/script.py
 ```

+**Note**: You should specify the spark driver and spark executor when you use NFS


This is to specify NFS options for both driver and executor, not "specify the spark driver and spark executor".

And it is not clear what you mean by "when you use NFS". Is it also needed for client mode?

jason-dai · 2021-05-20T07:30:21Z

The updated document is not very clear from the user point of view:

There are too many notes, distributed in section 3.1 and 3.2
It is not clear what user scenarios these notes apply to; try explaining these user scenarios
Maybe add a separate section for common topics or know issues.

* add hyperzoo for k8s support (#2140) * add hyperzoo for k8s support * format * format * format * format * run examples on k8s readme (#2163) * k8s readme * fix jdk download issue (#2219) * add doc for submit jupyter notebook and cluster serving to k8s (#2221) * add hyperzoo doc * add hyperzoo doc * add hyperzoo doc * add hyperzoo doc * fix jdk download issue (#2223) * bump to 0.9s (#2227) * update jdk download url (#2259) * update some previous docs (#2284) * K8docsupdate (#2306) * Update README.md * Update s3 related links in readme and documents (#2489) * Update s3 related links in readme and documents * Update s3 related links in readme and documents * Update s3 related links in readme and documents * Update s3 related links in readme and documents * Update s3 related links in readme and documents * Update s3 related links in readme and documents * update * update * modify line length limit * update * Update mxnet-mkl version in hyper-zoo dockerfile (#2720) Co-authored-by: gaoping <pingx.gao@intel.com> * update bigdl version (#2743) * update bigdl version * hyperzoo dockerfile add cluster-serving (#2731) * hyperzoo dockerfile add cluster-serving * update * update * update * update jdk url * update jdk url * update Co-authored-by: gaoping <pingx.gao@intel.com> * Support init_spark_on_k8s (#2813) * initial * fix * code refactor * bug fix * update docker * style * add conda to docker image (#2894) * add conda to docker image * Update Dockerfile * Update Dockerfile Co-authored-by: glorysdj <glorysdj@gmail.com> * Fix code blocks indents in .md files (#2978) * Fix code blocks indents in .md files Previously a lot of the code blocks in markdown files were horribly indented with bad white spaces in the beginning of lines. Users can't just select, copy, paste, and run (in the case of python). I have fixed all these, so there is no longer any code block with bad white space at the beginning of the lines. It would be nice if you could try to make sure in future commits that all code blocks are properly indented inside and have the right amount of white space in the beginning! * Fix small style issue * Fix indents * Fix indent and add \ for multiline commands Change indent from 3 spaces to 4, and add "\" for multiline bash commands Co-authored-by: Yifan Zhu <fanzhuyifan@gmail.com> * enable bigdl 0.12 (#3101) * switch to bigdl 0.12 * Hyperzoo example ref (#3143) * specify pip version to fix oserror 0 of proxy (#3165) * Bigdl0.12.1 (#3155) * bigdl 0.12.1 * bump 0.10.0-Snapshot (#3237) * update runtime image name (#3250) * update jdk download url (#3316) * update jdk8 url (#3411) Co-authored-by: ardaci <dongjie.shi@intel.com> * update hyperzoo docker image (#3429) * update hyperzoo image (#3457) * fix jdk in az docker (#3478) * fix jdk in az docker * fix jdk for hyperzoo * fix jdk in jenkins docker * fix jdk in cluster serving docker * fix jdk * fix readme * update python dep to fit cnvrg (#3486) * update ray version doc (#3568) * fix deploy hyperzoo issue (#3574) Co-authored-by: gaoping <pingx.gao@intel.com> * add spark fix and net-tools and status check (#3742) * intsall netstat and add check status * add spark fix for graphene * bigdl 0.12.2 (#3780) * bump to 0.11-S and fix version issues except ipynb * add multi-stage build Dockerfile (#3916) * add multi-stage build Dockerfile * multi-stage build dockerfile * multi-stage build dockerfile * Rename Dockerfile.multi to Dockerfile * delete Dockerfile.multi * remove comments, add TINI_VERSION to common arg, remove Dockerfile.multi * multi-stage add tf_slim Co-authored-by: shaojie <shaojiex.bai@intel.com> * update hyperzoo doc and k8s doc (#3959) * update userguide of k8s * update k8s guide * update hyperzoo doc * Update k8s.md add note * Update k8s.md add note * Update k8s.md update notes * fix 4087 issue (#4097) Co-authored-by: shaojie <shaojiex.bai@intel.com> * fixed 4086 and 4083 issues (#4098) Co-authored-by: shaojie <shaojiex.bai@intel.com> * Reduce image size (#4132) * Reduce Dockerfile size 1. del redis stage 2. del flink stage 3. del conda & exclude some python packages 4. add copies layer stage * update numpy version to 1.18.1 Co-authored-by: zzti-bsj <shaojiex.bai@intel.com> * update hyperzoo image (#4250) Co-authored-by: Adria777 <Adria777@github.com> * bigdl 0.13 (#4210) * bigdl 0.13 * update * print exception * pyspark2.4.6 * update release PyPI script * update * flip snapshot-0.12.0 and spark2.4.6 (#4254) * s-0.12.0 master * Update __init__.py * Update python.md * fix docker issues due to version update (#4280) * fix docker issues * fix docker issues * update Dockerfile to support spark 3.1.2 && 2.4.6 (#4436) Co-authored-by: shaojie <otnw_bsj@163.com> * update hyperzoo, add lib for tf2 (#4614) * delete tf 1.15.0 (#4719) Co-authored-by: Le-Zheng <30695225+Le-Zheng@users.noreply.github.com> Co-authored-by: pinggao18 <44043817+pinggao18@users.noreply.github.com> Co-authored-by: pinggao187 <44044110+pinggao187@users.noreply.github.com> Co-authored-by: gaoping <pingx.gao@intel.com> Co-authored-by: Kai Huang <huangkaivision@gmail.com> Co-authored-by: GavinGu07 <55721214+GavinGu07@users.noreply.github.com> Co-authored-by: Yifan Zhu <zhuyifan@stanford.edu> Co-authored-by: Yifan Zhu <fanzhuyifan@gmail.com> Co-authored-by: Song Jiaming <litchy233@gmail.com> Co-authored-by: ardaci <dongjie.shi@intel.com> Co-authored-by: Yang Wang <yang3.wang@intel.com> Co-authored-by: zzti-bsj <2779090360@qq.com> Co-authored-by: shaojie <shaojiex.bai@intel.com> Co-authored-by: Lingqi Su <33695124+Adria777@users.noreply.github.com> Co-authored-by: Adria777 <Adria777@github.com> Co-authored-by: shaojie <otnw_bsj@163.com>

Adria777 added 3 commits May 18, 2021 22:11

update userguide of k8s

d8272e4

update k8s guide

788b0c9

update hyperzoo doc

0c8eef3

glorysdj requested review from Le-Zheng and glorysdj May 20, 2021 00:39

glorysdj reviewed May 20, 2021

View reviewed changes

Update k8s.md

e737f49

add note

Adria777 changed the title ~~update hyperzoo doc~~ update hyperzoo doc and k8s doc May 20, 2021

Adria777 added 2 commits May 20, 2021 10:23

Update k8s.md

3ff5b48

add note

Update k8s.md

7ead4ea

update notes

glorysdj approved these changes May 20, 2021

View reviewed changes

glorysdj merged commit 28d5789 into intel-analytics:master May 20, 2021

jason-dai reviewed May 20, 2021

View reviewed changes

glorysdj mentioned this pull request May 24, 2021

update user guide of k8s #3952

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update hyperzoo doc and k8s doc #3959

update hyperzoo doc and k8s doc #3959

Adria777 commented May 19, 2021 •

edited

Loading

glorysdj May 20, 2021

glorysdj left a comment

jason-dai May 20, 2021

glorysdj May 22, 2021 •

edited

Loading

jason-dai May 20, 2021

glorysdj May 22, 2021 •

edited

Loading

jason-dai May 20, 2021

jason-dai May 20, 2021

jason-dai May 20, 2021

glorysdj May 22, 2021

jason-dai May 20, 2021 •

edited

Loading

jason-dai May 20, 2021

jason-dai commented May 20, 2021

		@@ -125,6 +125,26 @@ init_orca_context(cluster_mode="k8s", master="k8s://https://<k8s-apiserver-host>

		Execute `python script.py` to run your program on k8s cluster directly.

		Note: The k8s client and cluster mode do not support download files to local, logging callback, tensorboard callback, etc. If you have these requirements, it's a good idea to use network file system (NFS).

		Note: The k8s client and cluster mode do not support download files to local, logging callback, tensorboard callback, etc. If you have these requirements, it's a good idea to use network file system (NFS).

		Note: The k8s would delete the pod once the worker failed in client mode and cluster mode. If you want to get the content of of worker log, you could set an "temp-dir" to change the log dir to replace the former one. Please note that in this case you should set num-nodes to 1 if you use network file system (NFS). Otherwise, it would cause error because the temp-dir and NFS are not point to the same directory.


		If you training with more than 1 executor and use NFS, please remove `extra_params = {"temp-dir": "/tmp/ray/"}`. Because there would be conflict if multiple executors write files in the same directory at the same time. It may cause JSONDecodeError.

		Note: If you training with more than 1 executor, please make sure you set proper "steps_per_epoch" and "validation steps".


		Note: If you training with more than 1 executor, please make sure you set proper "steps_per_epoch" and "validation steps".

		Note: "spark.kubernetes.container.image.pullPolicy" needs to be specified as "always"


		Note: "spark.kubernetes.container.image.pullPolicy" needs to be specified as "always"

		Note: if "RayActorError" occurs, try to increase the memory

update hyperzoo doc and k8s doc #3959

update hyperzoo doc and k8s doc #3959

Conversation

Adria777 commented May 19, 2021 • edited Loading

Choose a reason for hiding this comment

glorysdj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glorysdj May 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glorysdj May 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jason-dai May 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jason-dai commented May 20, 2021

Adria777 commented May 19, 2021 •

edited

Loading

glorysdj May 22, 2021 •

edited

Loading

glorysdj May 22, 2021 •

edited

Loading

jason-dai May 20, 2021 •

edited

Loading