Create a getting started on K8s page #1932

viadea · 2021-03-14T23:59:25Z

No description provided.

This reverts commit f80cb6a. Rollback changes to README

Add Version Matrix in a spreadsheet.

Add Version Matrix in a spreadsheet. Signed-off-by: Hao Zhu <viadeazhu@gmail.com>

…ch-0.5

This reverts commit 744bb42.

This is a new getting-started-kubernetes.md with more examples and details for a quick-start.

Fixed one typo

tgravescs · 2021-03-15T14:16:39Z

I assume this is to replace https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#running-on-kubernetes ? so we probably need to remove that section.

docs/get-started/getting-started-kubernetes.md

Fixed typo. Co-authored-by: Jason Lowe <jlowe@nvidia.com>

Fixed typo Co-authored-by: Jason Lowe <jlowe@nvidia.com>

reword Co-authored-by: Jason Lowe <jlowe@nvidia.com>

reword

reword Co-authored-by: Jason Lowe <jlowe@nvidia.com>

tgravescs · 2021-03-22T15:13:58Z

ok I think that doc is fine but how is this fitting in with existing docs:
https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#running-on-kubernetes

why don't we just remove it here? or were you going to do that in a followup - if followup needs to be done right away so we don't forget about it.

tgravescs · 2021-03-22T15:14:05Z

build

btong04 · 2021-03-25T16:25:10Z

I was able to go through the documentation and create a k8s cluster, but it looks like the GPU operator plugin isn't working properly. I tried both the deepops deployment several times (on a desktop) and the alternative option of using kubeadmin+helm (on top of a new Ubuntu 18.04 install on AWS EC2). I had trouble getting ansible to authenticate on EC2 so I went ahead with kubeadmin+helm.

Here are similar bug reports to what I encountered with the GPU operator plugin: NVIDIA/gpu-operator#83, NVIDIA/gpu-operator#166. Spark-shell can be started, but it complains about no resources being available when running GPU enabled tasks. Spark worker web UI on port 4040 confirms that no resources were assigned. Are there alternatives to using GPU-Operator?

I was able to confirm that my dockerfile works with CPU only spark run on k8s prior to adding the GPU operator plugin.

viadea · 2021-03-29T22:29:04Z

I was able to go through the documentation and create a k8s cluster, but it looks like the GPU operator plugin isn't working properly. I tried both the deepops deployment several times (on a desktop) and the alternative option of using kubeadmin+helm (on top of a new Ubuntu 18.04 install on AWS EC2). I had trouble getting ansible to authenticate on EC2 so I went ahead with kubeadmin+helm.

Here are similar bug reports to what I encountered with the GPU operator plugin: NVIDIA/gpu-operator#83, NVIDIA/gpu-operator#166. Spark-shell can be started, but it complains about no resources being available when running GPU enabled tasks. Spark worker web UI on port 4040 confirms that no resources were assigned. Are there alternatives to using GPU-Operator?

I was able to confirm that my dockerfile works with CPU only spark run on k8s prior to adding the GPU operator plugin.

As discussed, how to install a working K8s cluster with nvidia GPU support is not in this scope of this article.
You should follow https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#option-2-installing-kubernetes-using-kubeadm to install a working k8s cluster.

Regarding the error message such as User "system:serviceaccount:default:default" cannot get resource "pods" in API group "" in the namespace "default". , you need to make sure you submit the spark job using a "service account" with proper ClusterRole binded. For example, in K8s on Spark Doc https://spark.apache.org/docs/latest/running-on-kubernetes.html#submitting-applications-to-kubernetes , it shows how to create a service account named "spark" with proper ClusterRole binded.

Of course, you can refer to k8s doc for more k8s cluster management tasks such as how to create service account, how to create ClusterRoleBinding, how to assign Resource Quotas etc.

viadea · 2021-04-03T00:46:42Z

@btong04

First of all, GPU operator is not a must for testing this Spark feature.

Today I also tried to use DeepOps to deploy a K8s cluster on a single EC2 machine, and it works fine for me as well.
Since DeepOps used kubespray as the K8s template, it may brings more complexity in terms of K8s troubleshooting.

For example, when I was testing installing K8s using DeepOPS, assume we use all default settings,
I also met several K8s related issues, here are several common K8s related issues and resolution:

There are 2 CoreDNS PODs with 1 POD pending.
This is by default, deployment dns-autoscaler is created in K8s cluster which requires at least 2 CoreDNS.
You can reduce it to 1 from 2 with below command:
kubectl edit configmap dns-autoscaler --namespace=kube-system
in above command,change "min":2 to "min":1, and then delete the pending POD if it is still there or it started crashing.
CoreDNS pod crashed with the reason as "OOMKilled"
This is because by default, CoreDNS POD has 170MB memory limit which may be too small for big cluster.
The fix is straightforward, just increase the deployment CoreDNS' resource limit:
kubectl set resources deployment.v1.apps/coredns --limits=cpu=1000m,memory=1024Mi
Spark on Kubernetes Job in client mode keeps failing
This may be related to the Spark job but it is nothing to do with the rapids accelerator here.
Any Spark job in client may fail if you are installing a K8s cluster on a ec2 instance using deepOps method.
The symptom is:
The Spark Driver may keep printing below messages:
Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources.
The Spark Executor may keeps crashing and restarting, but if we use "kubectl logs" to check the Executor POD, we will get the root cause:
Caused by: java.net.UnknownHostException: ip-xxx-xxx-xxx-xxx.cluster.local

Here the hostname -f shows as ".cluster.loca" while the hostname shows as .ec2.internal.
This is very strange because the correct EC2 hostname should be .ec2.internal.
Since this K8s cluster created by Kubespray by default enables nodelocal dns cache with default IP as 169.254.25.10.
So the DNS server inside POD is always pointing to 169.254.25.10.

Eventually the root cause is below entry was added by Ansible in /etc/hosts:

# Ansible inventory hosts BEGIN
xxx.xxx.xxx.xxx ip-xxx-xxx-xxx-xxx.cluster.local ip-xxx-xxx-xxx-xxx ip-xxx-xxx-xxx-xxx.ec2.internal.cluster.local ip-xxx-xxx-xxx-xxx.ec2.internal

After removing above entries in /etc/hosts, hostname -f become correct.
Basically we just let DNS server to resolve the hostname.
And then Spark in client job works fine.

I am pretty sure there could be more K8s related issues but here is just my quick summary for using DeepOPS to install a K8s cluster with a single EC2 instance.
Using DeepOps way to install a enterprise ready K8s cluster installs more K8s deployments comparing to a kubeadm method, so if you just want to test the basic functions of Spark+Rapids in K8s without much K8s troubleshooting experience, I would suggest you use kubeadm method.

* doing some test * Revert "doing some test" This reverts commit f80cb6a. Rollback changes to README * Update download.md Add Version Matrix in a spreadsheet. * Update download.md Add Version Matrix in a spreadsheet. Signed-off-by: Hao Zhu <viadeazhu@gmail.com> * Revert "Update download.md" This reverts commit 744bb42. * Create getting-started-kubernetes.md This is a new getting-started-kubernetes.md with more examples and details for a quick-start. * Fixed one typo Fixed one typo * Update docs/get-started/getting-started-kubernetes.md Fixed typo. Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md Fixed typo. Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md Fixed typo. Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md Fixed typo. Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md Fixed typo. Co-authored-by: Jason Lowe <jlowe@nvidia.com> * changed nav_order to 6 * Update docs/get-started/getting-started-kubernetes.md Fixed typo Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md Fixed typo Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md Fixed typo Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md Fixed typo Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md Fixed typo Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md Fixed typo Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md re-word. Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md reword Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md fix typo Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md reword Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md reword Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md reword Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md reword Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md reword Co-authored-by: Jason Lowe <jlowe@nvidia.com> * create "To delete the Driver POD" section create "To delete the Driver POD" section * Add a note. "This is a quick start guide which uses default settings which may be different from your cluster." * add spark.kubernetes.memoryOverheadFactor=0.6 add spark.kubernetes.memoryOverheadFactor=0.6 * Changed to spark.executor.memoryOverhead=3G Changed to spark.executor.memoryOverhead=3G * Added a note to explain the jar location * Update docs/get-started/getting-started-kubernetes.md reword Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md reword Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md reword Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md reword Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md reword Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md reword Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md reword Co-authored-by: Jason Lowe <jlowe@nvidia.com> * reword reword * Update docs/get-started/getting-started-kubernetes.md reword Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md reword Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md reword Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md reword Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md reword Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md reword Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md reword Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md reword Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update docs/get-started/getting-started-kubernetes.md reword Co-authored-by: Jason Lowe <jlowe@nvidia.com> Co-authored-by: Jason Lowe <jlowe@nvidia.com>

viadea and others added 10 commits March 5, 2021 21:28

doing some test

f80cb6a

Revert "doing some test"

47fdca4

This reverts commit f80cb6a. Rollback changes to README

Update download.md

ede56bd

Add Version Matrix in a spreadsheet.

Update download.md

744bb42

Add Version Matrix in a spreadsheet. Signed-off-by: Hao Zhu <viadeazhu@gmail.com>

Merge branch 'branch-0.5' of github.com:viadea/spark-rapids into bran…

521921e

…ch-0.5

Revert "Update download.md"

4a82093

This reverts commit 744bb42.

Merge remote-tracking branch 'upstream/branch-0.5' into branch-0.5

ca4d040

Merge remote-tracking branch 'upstream/branch-0.5' into branch-0.5

e0ed7ad

Create getting-started-kubernetes.md

5287650

This is a new getting-started-kubernetes.md with more examples and details for a quick-start.

Fixed one typo

2f366f9

Fixed one typo

jlowe changed the title ~~Create a getting started on K8s page.~~ Create a getting started on K8s page Mar 15, 2021

jlowe added the documentation Improvements or additions to documentation label Mar 15, 2021

tgravescs reviewed Mar 15, 2021

View reviewed changes