Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a getting started on K8s page #1932

Merged
merged 52 commits into from
Apr 7, 2021
Merged

Conversation

viadea
Copy link
Collaborator

@viadea viadea commented Mar 14, 2021

No description provided.

viadea and others added 10 commits March 5, 2021 21:28
This reverts commit f80cb6a.

Rollback changes to README
Add Version Matrix in a spreadsheet.
Add Version Matrix in a spreadsheet.

Signed-off-by: Hao Zhu <viadeazhu@gmail.com>
This reverts commit 744bb42.
This is a new getting-started-kubernetes.md with more examples and details for a quick-start.
Fixed one typo
@jlowe jlowe changed the title Create a getting started on K8s page. Create a getting started on K8s page Mar 15, 2021
@jlowe jlowe added the documentation Improvements or additions to documentation label Mar 15, 2021
@tgravescs
Copy link
Collaborator

I assume this is to replace https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#running-on-kubernetes ? so we probably need to remove that section.

docs/get-started/getting-started-kubernetes.md Outdated Show resolved Hide resolved
docs/get-started/getting-started-kubernetes.md Outdated Show resolved Hide resolved
docs/get-started/getting-started-kubernetes.md Outdated Show resolved Hide resolved
docs/get-started/getting-started-kubernetes.md Outdated Show resolved Hide resolved
docs/get-started/getting-started-kubernetes.md Outdated Show resolved Hide resolved
docs/get-started/getting-started-kubernetes.md Outdated Show resolved Hide resolved
docs/get-started/getting-started-kubernetes.md Outdated Show resolved Hide resolved
docs/get-started/getting-started-kubernetes.md Outdated Show resolved Hide resolved
viadea and others added 10 commits March 15, 2021 12:19
Fixed typo.

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Fixed typo.

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Fixed typo.

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Fixed typo.

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Fixed typo.

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Fixed typo

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Fixed typo

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Fixed typo

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Fixed typo

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
viadea and others added 17 commits March 17, 2021 14:03
reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
reword
reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
@viadea viadea requested a review from tgravescs March 17, 2021 21:36
@tgravescs
Copy link
Collaborator

ok I think that doc is fine but how is this fitting in with existing docs:
https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#running-on-kubernetes

why don't we just remove it here? or were you going to do that in a followup - if followup needs to be done right away so we don't forget about it.

@tgravescs
Copy link
Collaborator

build

@btong04
Copy link

btong04 commented Mar 25, 2021

I was able to go through the documentation and create a k8s cluster, but it looks like the GPU operator plugin isn't working properly. I tried both the deepops deployment several times (on a desktop) and the alternative option of using kubeadmin+helm (on top of a new Ubuntu 18.04 install on AWS EC2). I had trouble getting ansible to authenticate on EC2 so I went ahead with kubeadmin+helm.

Here are similar bug reports to what I encountered with the GPU operator plugin: NVIDIA/gpu-operator#83, NVIDIA/gpu-operator#166. Spark-shell can be started, but it complains about no resources being available when running GPU enabled tasks. Spark worker web UI on port 4040 confirms that no resources were assigned. Are there alternatives to using GPU-Operator?

I was able to confirm that my dockerfile works with CPU only spark run on k8s prior to adding the GPU operator plugin.

@viadea
Copy link
Collaborator Author

viadea commented Mar 29, 2021

I was able to go through the documentation and create a k8s cluster, but it looks like the GPU operator plugin isn't working properly. I tried both the deepops deployment several times (on a desktop) and the alternative option of using kubeadmin+helm (on top of a new Ubuntu 18.04 install on AWS EC2). I had trouble getting ansible to authenticate on EC2 so I went ahead with kubeadmin+helm.

Here are similar bug reports to what I encountered with the GPU operator plugin: NVIDIA/gpu-operator#83, NVIDIA/gpu-operator#166. Spark-shell can be started, but it complains about no resources being available when running GPU enabled tasks. Spark worker web UI on port 4040 confirms that no resources were assigned. Are there alternatives to using GPU-Operator?

I was able to confirm that my dockerfile works with CPU only spark run on k8s prior to adding the GPU operator plugin.

As discussed, how to install a working K8s cluster with nvidia GPU support is not in this scope of this article.
You should follow https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#option-2-installing-kubernetes-using-kubeadm to install a working k8s cluster.

Regarding the error message such as User "system:serviceaccount:default:default" cannot get resource "pods" in API group "" in the namespace "default". , you need to make sure you submit the spark job using a "service account" with proper ClusterRole binded. For example, in K8s on Spark Doc https://spark.apache.org/docs/latest/running-on-kubernetes.html#submitting-applications-to-kubernetes , it shows how to create a service account named "spark" with proper ClusterRole binded.

Of course, you can refer to k8s doc for more k8s cluster management tasks such as how to create service account, how to create ClusterRoleBinding, how to assign Resource Quotas etc.

@viadea
Copy link
Collaborator Author

viadea commented Apr 3, 2021

@btong04

First of all, GPU operator is not a must for testing this Spark feature.

Today I also tried to use DeepOps to deploy a K8s cluster on a single EC2 machine, and it works fine for me as well.
Since DeepOps used kubespray as the K8s template, it may brings more complexity in terms of K8s troubleshooting.

For example, when I was testing installing K8s using DeepOPS, assume we use all default settings,
I also met several K8s related issues, here are several common K8s related issues and resolution:

  1. There are 2 CoreDNS PODs with 1 POD pending.
    This is by default, deployment dns-autoscaler is created in K8s cluster which requires at least 2 CoreDNS.
    You can reduce it to 1 from 2 with below command:
    kubectl edit configmap dns-autoscaler --namespace=kube-system
    in above command,change "min":2 to "min":1, and then delete the pending POD if it is still there or it started crashing.

  2. CoreDNS pod crashed with the reason as "OOMKilled"
    This is because by default, CoreDNS POD has 170MB memory limit which may be too small for big cluster.
    The fix is straightforward, just increase the deployment CoreDNS' resource limit:
    kubectl set resources deployment.v1.apps/coredns --limits=cpu=1000m,memory=1024Mi

  3. Spark on Kubernetes Job in client mode keeps failing
    This may be related to the Spark job but it is nothing to do with the rapids accelerator here.
    Any Spark job in client may fail if you are installing a K8s cluster on a ec2 instance using deepOps method.
    The symptom is:
    The Spark Driver may keep printing below messages:
    Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources.
    The Spark Executor may keeps crashing and restarting, but if we use "kubectl logs" to check the Executor POD, we will get the root cause:
    Caused by: java.net.UnknownHostException: ip-xxx-xxx-xxx-xxx.cluster.local

Here the hostname -f shows as ".cluster.loca" while the hostname shows as .ec2.internal.
This is very strange because the correct EC2 hostname should be .ec2.internal.
Since this K8s cluster created by Kubespray by default enables nodelocal dns cache with default IP as 169.254.25.10.
So the DNS server inside POD is always pointing to 169.254.25.10.

Eventually the root cause is below entry was added by Ansible in /etc/hosts:

# Ansible inventory hosts BEGIN
xxx.xxx.xxx.xxx ip-xxx-xxx-xxx-xxx.cluster.local ip-xxx-xxx-xxx-xxx ip-xxx-xxx-xxx-xxx.ec2.internal.cluster.local ip-xxx-xxx-xxx-xxx.ec2.internal

After removing above entries in /etc/hosts, hostname -f become correct.
Basically we just let DNS server to resolve the hostname.
And then Spark in client job works fine.

I am pretty sure there could be more K8s related issues but here is just my quick summary for using DeepOPS to install a K8s cluster with a single EC2 instance.
Using DeepOps way to install a enterprise ready K8s cluster installs more K8s deployments comparing to a kubeadm method, so if you just want to test the basic functions of Spark+Rapids in K8s without much K8s troubleshooting experience, I would suggest you use kubeadm method.

@jlowe jlowe merged commit 12a1b71 into NVIDIA:branch-0.5 Apr 7, 2021
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* doing some test

* Revert "doing some test"

This reverts commit f80cb6a.

Rollback changes to README

* Update download.md

Add Version Matrix in a spreadsheet.

* Update download.md

Add Version Matrix in a spreadsheet.

Signed-off-by: Hao Zhu <viadeazhu@gmail.com>

* Revert "Update download.md"

This reverts commit 744bb42.

* Create getting-started-kubernetes.md

This is a new getting-started-kubernetes.md with more examples and details for a quick-start.

* Fixed one typo

Fixed one typo

* Update docs/get-started/getting-started-kubernetes.md

Fixed typo.

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

Fixed typo.

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

Fixed typo.

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

Fixed typo.

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

Fixed typo.

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* changed nav_order to 6

* Update docs/get-started/getting-started-kubernetes.md

Fixed typo

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

Fixed typo

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

Fixed typo

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

Fixed typo

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

Fixed typo

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

Fixed typo

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

re-word.

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

fix typo

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* create "To delete the Driver POD" section

create "To delete the Driver POD" section

* Add a note.

"This is a quick start guide which uses default settings which may be different from your cluster."

* add spark.kubernetes.memoryOverheadFactor=0.6

add spark.kubernetes.memoryOverheadFactor=0.6

* Changed to spark.executor.memoryOverhead=3G 

Changed to spark.executor.memoryOverhead=3G

* Added a note to explain the jar location

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* reword

reword

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* doing some test

* Revert "doing some test"

This reverts commit f80cb6a.

Rollback changes to README

* Update download.md

Add Version Matrix in a spreadsheet.

* Update download.md

Add Version Matrix in a spreadsheet.

Signed-off-by: Hao Zhu <viadeazhu@gmail.com>

* Revert "Update download.md"

This reverts commit 744bb42.

* Create getting-started-kubernetes.md

This is a new getting-started-kubernetes.md with more examples and details for a quick-start.

* Fixed one typo

Fixed one typo

* Update docs/get-started/getting-started-kubernetes.md

Fixed typo.

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

Fixed typo.

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

Fixed typo.

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

Fixed typo.

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

Fixed typo.

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* changed nav_order to 6

* Update docs/get-started/getting-started-kubernetes.md

Fixed typo

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

Fixed typo

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

Fixed typo

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

Fixed typo

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

Fixed typo

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

Fixed typo

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

re-word.

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

fix typo

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* create "To delete the Driver POD" section

create "To delete the Driver POD" section

* Add a note.

"This is a quick start guide which uses default settings which may be different from your cluster."

* add spark.kubernetes.memoryOverheadFactor=0.6

add spark.kubernetes.memoryOverheadFactor=0.6

* Changed to spark.executor.memoryOverhead=3G 

Changed to spark.executor.memoryOverhead=3G

* Added a note to explain the jar location

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* reword

reword

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/get-started/getting-started-kubernetes.md

reword

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants