Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update hyperzoo doc and k8s doc #3959

Merged
merged 6 commits into from
May 20, 2021
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 90 additions & 2 deletions docker/hyperzoo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,95 @@ Then pull the image. It will be faster.
sudo docker pull intelanalytics/hyper-zoo:latest
```

2. Launch a k8s client container:
2.K8s configuration

Get k8s master as spark master :

```bash
kubectl cluster-info
```

After running this commend, it shows "Kubernetes master is running at https://127.0.0.1:12345 "

this means :

```bash
master="k8s://https://127.0.0.1:12345"
```

The namespace is default or spark.kubernetes.namespace

RBAC :

```bash
kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default
```

View k8s configuration file :

```
.kube/config
```

or

```bash
kubectl config view --flatten --minify > kuberconfig
```

The k8s data can stored in nfs or ceph, take nfs as an example

In NFS server, run :

```bash
yum install nfs-utils
systemctl enable rpcbind
systemctl enable nfs
systemctl start rpcbind
firewall-cmd --zone=public --permanent --add-service={rpc-bind,mountd,nfs}
firewall-cmd --reload
mkdir /disk1/nfsdata
chmod 755 /disk1/nfsdata
nano /etc/exports "/disk1/nfsdata *(rw,sync,no_root_squash,no_all_squash)"
systemctl restart nfs
```

In NFS client, run :

```bash
yum install -y nfs-utils && systemctl start rpcbind && showmount -e <nfs-master-ip-address>
```

k8s conf :

```bash
git clone https://github.com/kubernetes-incubator/external-storage.git
cd /XXX/external-storage/nfs-client
nano deploy/deployment.yaml
nano deploy/rbac.yaml
kubectl create -f deploy/rbac.yaml
kubectl create -f deploy/deployment.yaml
kubectl create -f deploy/class.yaml
```

test :

```bash
kubectl create -f deploy/test-claim.yaml
kubectl create -f deploy/test-pod.yaml
kubectl get pvc
kubectl delete -f deploy/test-pod.yaml
kubectl delete -f deploy/test-claim.yaml
```

if the test is success, then run:

```bash
kubectl create -f deploy/nfs-volume-claim.yaml
```

3.Launch a k8s client container:

Please note the two different containers: **client container** is for user to submit zoo jobs from here, since it contains all the required env and libs except hadoop/k8s configs; executor container is not need to create manually, which is scheduled by k8s at runtime.

Expand Down Expand Up @@ -313,4 +401,4 @@ ${SPARK_HOME}/bin/spark-submit \
--conf "spark.driver.extraJavaOptions=-Dbigdl.engineType=mklblas" \
--class com.intel.analytics.zoo.serving.ClusterServing \
local:/opt/analytics-zoo-0.8.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.10.0-spark_2.4.3-0.8.0-SNAPSHOT-jar-with-dependencies.jar
```
```
32 changes: 31 additions & 1 deletion docs/readthedocs/source/doc/UserGuide/k8s.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,24 @@ init_orca_context(cluster_mode="k8s", master="k8s://https://<k8s-apiserver-host>

Execute `python script.py` to run your program on k8s cluster directly.

**Note**: The k8s client and cluster mode do not support download files to local, logging callback, tensorboard callback, etc. If you have these requirements, it's a good idea to use network file system (NFS).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear how "logging callback, tensorboard callback, etc." are related to NFS? And please specify how the user can use NFS in this case.

Copy link
Contributor

@glorysdj glorysdj May 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we should add the guide for how to mount k8s PERSISTENT_VOLUME_CLAIM to spark executor and driver pods with configs:

--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/zoo \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/zoo

For logging and tensorboard callbacks, if the outputs need to be persisted out of pod's lifecycle, user need to set the output dir to the mounted persistent vloume dir. NFS is a simple example.


**Note**: The k8s would delete the pod once the worker failed in client mode and cluster mode. If you want to get the content of of worker log, you could set an "temp-dir" to change the log dir to replace the former one. Please note that in this case you should set num-nodes to 1 if you use network file system (NFS). Otherwise, it would cause error because the temp-dir and NFS are not point to the same directory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a common issue for both client and cluster mode, you should put it outside of section 3.1.

And it is not clear

  1. What value should be for "temp-dir"? A local folder? An NFS folder?
  2. What do you mean by "if you use network file system (NFS)"?
  3. What you mean by "the temp-dir and NFS are not point to the same directory"?

Copy link
Contributor

@glorysdj glorysdj May 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a common issue for both client and cluster node.
we need to add a section for how to debug ray logs in k8s, we should try spark local storage with same temp-dir or improve our ray start with different temp-dir when use a public storage.

```python
init_orca_context(..., extra_params = {"temp-dir": "/tmp/ray/"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also add note that if more than 1 executor, please rm extra_params = {"temp-dir": "/tmp/ray/"} since conflicts of writes will happen and JSONDecodeError will happen

```

**Note**: If you training with more than 1 executor, please make sure you set proper "steps_per_epoch" and "validation steps".
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to set proper "steps_per_epoch" and "validation steps"?


**Note**: "spark.kubernetes.container.image.pullPolicy" needs to be specified as "always"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise? Is it also needed for cluster mode?

And is there a way to set this automatically for the user?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a common settings for both client and cluster mode, we should move this to public section, and the default value is "IfNotPresent" so can not be automatically set.


**Note**: if "RayActorError" occurs, try to increase the memory
Copy link
Collaborator

@jason-dai jason-dai May 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RayActorError is a generic error; are there other specific error messages? And is it also needed for cluster mode?


```python
init_orca_context(..., memory=10g, exra_executor_memory_for_ray=100g)
```

#### **3.2 K8s cluster mode**

For k8s [cluster mode](https://spark.apache.org/docs/2.4.5/running-on-kubernetes.html#cluster-mode), you can call `init_orca_context` and specify cluster_mode to be "spark-submit" in your python script (e.g. in script.py):
Expand All @@ -151,6 +169,18 @@ ${ANALYTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh \
file:///path/script.py
```

**Note**: You should specify the spark driver and spark executor when you use NFS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to specify NFS options for both driver and executor, not "specify the spark driver and spark executor".

And it is not clear what you mean by "when you use NFS". Is it also needed for client mode?


```bash
${ANALYTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh \
--... ...\
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName="nfsvolumeclaim" \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path="/zoo" \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName="nfsvolumeclaim" \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path="/zoo" \
file:///path/script.py
```

#### **3.3 Run Jupyter Notebooks**

After a Docker container is launched and user login into the container, you can start the Jupyter Notebook service inside the container.
Expand Down Expand Up @@ -244,4 +274,4 @@ Or clean up the entire spark application by pod label:

```bash
$ kubectl delete pod -l <pod label>
```
```