tests/smoke/aws: document manual execution, verification and troubles…

…hooting Partially fixes coreos#1038 (AWS only)
s-urbaniak · Jun 14, 2017 · 68feb5d · 68feb5d
1 parent 2592040
commit 68feb5d
Show file tree

Hide file tree

Showing 10 changed files with 194 additions and 10 deletions.
diff --git a/Documentation/troubleshooting/etcd-nodes.md b/Documentation/troubleshooting/etcd-nodes.md
@@ -0,0 +1,56 @@
+# Troubleshooting worker nodes using SSH
+
+Tectonic worker nodes are not assigned a public IP address, only the controller node. To debug a worker node, SSH to it through a controller (bastion host) or use a VPN connected to the internal network.
+
+To do so, perform the following:
+
+## Set up SSH agent forwarding
+
+Once a passphrase of the local ssh key is added to `ssh-agent`, you will not be prompted for the credentials the next time connecting to nodes via SSH or SCP. The following instructions outline adding a passphrase to the `ssh-agent` on the system.
+
+  1. At the terminal, enter:
+
+  `$ eval ssh-agent`
+
+  2. Run the following:
+
+  `$ ssh-add`
+
+  The `ssh-add` command prompts for a private key passphrase and adds it to the list maintained by `ssh-agent`.
+
+  3. Enter your private key passphrase.
+
+  4. Before logging out, run the following:
+
+  `$ kill $SSH_AGENT_PID`
+
+  To automatically run this command when logging out, place it in the `.logout` file if you are using csh or tcsh. Place the command in the `.bash_logout` file if you are using bash.
+
+## Connect to a master node
+
+SSH to a master node with its `EXTERNAL-IP`, providing the `-A` flag to forward the local `ssh-agent`. Add the `-i` option giving the location of the ssh key known to Tectonic:
+
+```bash
+$ ssh -A core@10.0.23.37 -i /path/to/tectonic/cluster/ssh/key
+```
+
+## Get the IP address of the etcd nodes
+
+Run the following command on the master instance:
+
+```sh
+core@ip-10-0-23-37 ~ $ grep etcd /opt/tectonic/manifests/kube-apiserver.yaml 
+        - --etcd-servers=http://10.0.23.31:2379
+```
+
+## Connect to an etcd node
+
+```sh
+# From the master node
+$ ssh core@10.0.23.31
+```
+
+To inspect the `etcd-member` service logs, execute:
+```sh
+$ systemctl status etcd-member && journalctl etcd-member
+```
diff --git a/Documentation/troubleshooting/master-nodes.md b/Documentation/troubleshooting/master-nodes.md
@@ -0,0 +1,53 @@
+# Troubleshooting master nodes using SSH
+
+Tectonic worker nodes are usually assigned to a public IP address. To debug a master node, SSH to it or use a VPN connected to the internal network.
+
+    View logs on the worker node by using `journalctl -xe` or similar tools. [Reading the system log][journalctl] has more information.
+
+If the cluster is deployed on AWS, to check if the `init-assets` service started successfully, execute:
+```sh
+$ systemctl status init-assets && journalctl -u init-assets
+$ ls /opt/tectonic
+```
+
+To examine if the kubelet log, execute:
+```sh
+$ systemctl status kubelet && journalctl -u kubelet
+```
+
+To examine the status and logs of the bootstrap and target control plane containers, execute:
+```sh
+$ docker ps -a | grep -v pause | grep apiserver
+65faeddd2b78        quay.io/coreos/hyperkube@sha256:297f45919160ea076831cd067833ad3b64c789fcb3491016822e6f867d16dcd5                               "/usr/bin/flock /var/"   13 minutes ago      Up 13 minutes                                   k8s_kube-apiserver_kube-apiserver-90pzs_kube-system_2983ff1c-510e-11e7-bc88-063d653969e3_0
+$ docker logs 65faeddd2b78
+```
+
+The `bootkube` service is responsible for bootstrapping the temporary control plane and to bootstrap a vanilla Kubernetes control plane.
+To examine the `bootkube` logs, execute:
+```sh
+$ journalctl -u bootkube
+...
+Jun 14 14:31:39 ip-10-0-23-37 bash[1313]: [  219.261765] bootkube[5]:         Pod Status:        pod-checkpointer        Pending
+Jun 14 14:31:39 ip-10-0-23-37 bash[1313]: [  219.262217] bootkube[5]:         Pod Status:          kube-apiserver        Running
+Jun 14 14:31:39 ip-10-0-23-37 bash[1313]: [  219.262518] bootkube[5]:         Pod Status:          kube-scheduler        Pending
+Jun 14 14:31:39 ip-10-0-23-37 bash[1313]: [  219.262746] bootkube[5]:         Pod Status: kube-controller-manager        Pending
+...
+Jun 14 14:32:44 ip-10-0-23-37 bash[1313]: [  284.264617] bootkube[5]:         Pod Status: kube-controller-manager        Running
+Jun 14 14:32:49 ip-10-0-23-37 bash[1313]: [  289.263245] bootkube[5]:         Pod Status:        pod-checkpointer        Running
+Jun 14 14:32:49 ip-10-0-23-37 bash[1313]: [  289.263932] bootkube[5]:         Pod Status:          kube-apiserver        Running
+Jun 14 14:32:49 ip-10-0-23-37 bash[1313]: [  289.264715] bootkube[5]:         Pod Status: kube-controller-manager        Running
+...
+Jun 14 14:34:29 ip-10-0-23-37 bash[1313]: [  389.299380] bootkube[5]: Tearing down temporary bootstrap control plane...
+Jun 14 14:34:29 ip-10-0-23-37 systemd[1]: Started Bootstrap a Kubernetes cluster.
+```
+
+The `tectonic` service is responsible for installing the actual Tectonic assets on the bootstrapped vanilla cluster.
+To examine the `tectonic` installation logs, execute:
+```sh
+$ journalctl -fu tectonic
+Jun 14 14:36:22 ip-10-0-23-37 bash[4763]: [  502.655337] hyperkube[5]: Pods not available yet, waiting for 5 seconds (10)
+Jun 14 14:36:27 ip-10-0-23-37 bash[4763]: [  507.955606] hyperkube[5]: Tectonic installation is done
+Jun 14 14:36:28 ip-10-0-23-37 systemd[1]: Started Bootstrap a Tectonic cluster.
+```
+
+[journalctl]: https://github.com/coreos/docs/blob/master/os/reading-the-system-log.md
diff --git a/Documentation/troubleshooting/troubleshooting.md b/Documentation/troubleshooting/troubleshooting.md
@@ -3,7 +3,9 @@
 This directory contains documents about troubleshooting Tectonic clusters.
 
 * [Troubleshooting Tectonic Installer][installer-terraform] describes troubleshooting the installation process itself, including reapplying after a partially completed run, and errors related to missing Terraform modules and failed `tectonic.service` unit.
-* [Troubleshooting worker nodes using SSH][worker-nodes] describes how to SSH into a controller or worker node to troubleshoot at the host level.
+* [Troubleshooting worker nodes using SSH][worker-nodes] describes how to SSH into a master or worker node to troubleshoot at the host level.
+* [Troubleshooting master nodes using SSH][master-nodes] describes how to SSH into a master node to troubleshoot at the host level.
+* [Troubleshooting etcd nodes using SSH][etcd-nodes] describes how to SSH into a etcd node to troubleshoot at the host level.
 * [Disaster recovery of Scheduler and Controller Manager pods][controller-recovery] describes how to recover a Kubernetes cluster from the failure of certain control plane components.
 * [Etcd snapshot troubleshooting][etcd-backup-restore] explains how to spin up a local Kubernetes API server from a backup of another cluster's etcd state for troubleshooting.
 * The [Tectonic FAQ][faq] answers some common questions about Tectonic versioning, licensing, and other general matters.
@@ -14,3 +16,5 @@ This directory contains documents about troubleshooting Tectonic clusters.
 [faq]: faq.md
 [installer-terraform]: installer-terraform.md
 [worker-nodes]: worker-nodes.md
+[master-nodes]: master-nodes.md
+[etcd-nodes]: etcd-nodes.md
diff --git a/Documentation/troubleshooting/worker-nodes.md b/Documentation/troubleshooting/worker-nodes.md
@@ -62,5 +62,17 @@ $ ssh core@192.0.2.18
 
 View logs on the worker node by using `journalctl -xe` or similar tools. [Reading the system log][journalctl] has more information.
 
+To examine the kubelet logs, execute:
+```sh
+# From worker node
+$ journalctl -u kubelet
+```
+
+To examine the status and logs of potentially failed containers, execute:
+```sh
+$ docker ps -a | grep -v pause 
+...
+$ docker logs ...
+```
 
 [journalctl]: https://github.com/coreos/docs/blob/master/os/reading-the-system-log.md
diff --git a/installer/tests/sanity/README.md b/installer/tests/sanity/README.md
@@ -20,3 +20,9 @@ Tests can then be run using `go test`.
 go test -v -i github.com/coreos/tectonic-installer/installer/tests/sanity
 go test -v github.com/coreos/tectonic-installer/installer/tests/sanity
 ```
+
+To build the `sanity` binary for smoke tests, execute:
+```sh
+$ cd installer
+$ make bin/sanity
+```
diff --git a/tests/smoke/aws/README.md b/tests/smoke/aws/README.md
@@ -15,11 +15,31 @@ All Tectonic smoke tests on AWS are run using the `smoke.sh` script found in thi
 ./smoke.sh help
 ```
 
+## Environment
+To begin, verify that the following environment variables are set:
+
+- `AWS_PROFILE` or alternatively `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`: These credentials are used by Terraform to spawn clusters on AWS.
+- `TF_VAR_tectonic_pull_secret_path` and `TF_VAR_tectonic_license_path`: The local path to the pull secret and Tectonic license file.
+- `TF_VAR_tectonic_aws_ssh_key`: The AWS ssh key pair which enables ssh'ing into the created machines using the `core` user.
+  It must be present in AWS under "EC2 -> Network & Security -> Key Pairs".
+- (optional) `BUILD_ID`: Any number >= 1. Based on this number the region will be selected of the deployed cluster.
+  See the `REGIONS` variable under `smoke.sh` for details.
+- (optional) `BRANCH_NAME`: The local branch name used as an infix for cluster names.
+  A sensible value is `git rev-parse --abbrev-ref HEAD`.
+
+Example:
+```
+$ export AWS_ACCESS_KEY_ID=AKIAIQ5TVFGQ7CKWD6IA
+$ export AWS_SECRET_ACCESS_KEY_ID=rtp62V7H/JDY3cNBAs5vA0coaTou/OQbqJk96Hws
+$ export TF_VAR_tectonic_license_path="/home/user/tectonic-license"
+$ export TF_VAR_tectonic_pull_secret_path="/home/user/coreos-inc/pull-secret"
+$ export TF_VAR_tectonic_aws_ssh_key="user"
+```
+
 ## Assume Role
 Smoke tests should be run with a role with limited privileges to ensure that the Tectonic Installer is as locked-down as possible.
 The following steps create or update a role with restricted access and assume that role for testing purposes.
-To begin, verify that the `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables are set in the current shell.
-Now, edit the trust policy located at `Documentation/files/aws-sts-trust-policy.json` to include the ARN for the AWS user to be used for testing.
+Edit the trust policy located at `Documentation/files/aws-sts-trust-policy.json` to include the ARN for the AWS user to be used for testing.
 Then, define a role name, e.g. `tectonic-installer`, and use the `smoke.sh` script to create a role or update an existing one:
 ```sh
 export ROLE_NAME=tectonic-installer
@@ -32,13 +52,17 @@ Now that the role exists, assume the role:
 source ./smoke.sh assume-role "$ROLE_NAME"
 ```
 
+Note that this step can be skipped for local testing.
+
 ## Create and Test Cluster
-Once the role has been assumed, select a Tectonic configuration to test, e.g. `aws-exp.tfvars` and plan a cluster:
+Once the role has been assumed, select a Tectonic configuration to test, e.g. `vars/aws-exp.tfvars` and plan a cluster:
 ```sh
 export TEST_VARS=vars/aws-exp.tfvars
 ./smoke.sh plan $TEST_VARS
 ```
 
+This will create a terraform state directory in the project's top-level build directory, i.e. `/build/aws-exp-master-1012345678901`.
+
 Continue by actually creating the cluster:
 ```sh
 ./smoke.sh create $TEST_VARS
@@ -47,10 +71,43 @@ Continue by actually creating the cluster:
 Finally, test the cluster:
 ```sh
 ./smoke.sh test $TEST_VARS
+=== RUN   TestCluster
+=== RUN   TestCluster/APIAvailable
+=== RUN   TestCluster/AllNodesRunning
+=== RUN   TestCluster/GetLogs
+=== RUN   TestCluster/AllPodsRunning
+=== RUN   TestCluster/KillAPIServer
+...
+PASS
 ```
 
 ## Cleanup
 Once all testing has concluded, clean up the AWS resources that were created:
 ```sh
 ./smoke.sh destroy $TEST_VARS
 ```
+
+## Sanity test cheatsheet
+
+To be able to ssh into the created machines, determine the generated cluster name and use the [AWS client](http://docs.aws.amazon.com/cli/latest/userguide/installing.html) to retrieve the public IP address or search for nodes having the cluster name via the AWS Web UI in "EC2 -> Instances":
+
+```sh
+$ ls build
+aws-exp-master-1012345678901
+
+$ export CLUSTER_NAME=aws-exp-master-1012345678901
+
+$ aws autoscaling describe-auto-scaling-groups \
+    | jq -r '.AutoScalingGroups[] | select(.AutoScalingGroupName | contains("'${CLUSTER_NAME}'")) | .Instances[].InstanceId' \
+    | xargs aws ec2 describe-instances --instance-ids \
+    | jq '.Reservations[].Instances[] | select(.PublicIpAddress != null) | .PublicIpAddress'
+"52.15.184.15"
+
+$ ssh -A core@52.15.184.15
+```
+
+Once connected to the master node, follow the [troubleshooting guide](../../../Documentation/troubleshooting/troubleshooting.md) for master, worker, and etcd nodes to investigate the following checklist:
+
+- SSH connectivity to the master/worker/etcd nodes
+- Successful start of all relevant installation service units on the corresponding nodes
+- Successful login to the Tectonic Console
diff --git a/tests/smoke/aws/smoke.sh b/tests/smoke/aws/smoke.sh
@@ -3,6 +3,8 @@ set -ex -o pipefail
 shopt -s expand_aliases
 DIR="$(cd "$(dirname "$0")" && pwd)"
 WORKSPACE=${WORKSPACE:-"$(cd "$DIR"/../../.. && pwd)"}
+BUILD_ID=${BUILD_ID:-1}
+BRANCH_NAME=${BRANCH_NAME:-$(git rev-parse --abbrev-ref HEAD)}
 # Alias filter for convenience
 # shellcheck disable=SC2139
 alias filter="$WORKSPACE"/installer/scripts/filter.sh

diff --git a/tests/smoke/aws/vars/aws-ca.tfvars b/tests/smoke/aws/vars/aws-ca.tfvars
@@ -18,8 +18,6 @@ tectonic_ca_cert = "../../examples/fake-creds/ca.crt"
 
 tectonic_ca_key = "../../examples/fake-creds/ca.key"
 
-tectonic_aws_ssh_key = "jenkins"
-
 tectonic_aws_master_ec2_type = "m4.large"
 
 tectonic_aws_worker_ec2_type = "m4.large"

diff --git a/tests/smoke/aws/vars/aws-exp.tfvars b/tests/smoke/aws/vars/aws-exp.tfvars
@@ -14,8 +14,6 @@ tectonic_ca_cert = ""
 
 tectonic_ca_key = ""
 
-tectonic_aws_ssh_key = "jenkins"
-
 tectonic_aws_master_ec2_type = "m4.large"
 
 tectonic_aws_worker_ec2_type = "m4.large"

diff --git a/tests/smoke/aws/vars/aws.tfvars b/tests/smoke/aws/vars/aws.tfvars
@@ -18,8 +18,6 @@ tectonic_ca_cert = ""
 
 tectonic_ca_key = ""
 
-tectonic_aws_ssh_key = "jenkins"
-
 tectonic_aws_master_ec2_type = "m4.large"
 
 tectonic_aws_worker_ec2_type = "m4.large"