Skip to content

Commit

Permalink
tests/smoke/aws: document manual execution, verification and troubles…
Browse files Browse the repository at this point in the history
…hooting

Partially fixes coreos#1038 (AWS only)
  • Loading branch information
Sergiusz Urbaniak committed Jun 14, 2017
1 parent 2592040 commit 68feb5d
Show file tree
Hide file tree
Showing 10 changed files with 194 additions and 10 deletions.
56 changes: 56 additions & 0 deletions Documentation/troubleshooting/etcd-nodes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Troubleshooting worker nodes using SSH

Tectonic worker nodes are not assigned a public IP address, only the controller node. To debug a worker node, SSH to it through a controller (bastion host) or use a VPN connected to the internal network.

To do so, perform the following:

## Set up SSH agent forwarding

Once a passphrase of the local ssh key is added to `ssh-agent`, you will not be prompted for the credentials the next time connecting to nodes via SSH or SCP. The following instructions outline adding a passphrase to the `ssh-agent` on the system.

1. At the terminal, enter:

`$ eval ssh-agent`

2. Run the following:

`$ ssh-add`

The `ssh-add` command prompts for a private key passphrase and adds it to the list maintained by `ssh-agent`.

3. Enter your private key passphrase.

4. Before logging out, run the following:

`$ kill $SSH_AGENT_PID`

To automatically run this command when logging out, place it in the `.logout` file if you are using csh or tcsh. Place the command in the `.bash_logout` file if you are using bash.

## Connect to a master node

SSH to a master node with its `EXTERNAL-IP`, providing the `-A` flag to forward the local `ssh-agent`. Add the `-i` option giving the location of the ssh key known to Tectonic:

```bash
$ ssh -A core@10.0.23.37 -i /path/to/tectonic/cluster/ssh/key
```

## Get the IP address of the etcd nodes

Run the following command on the master instance:

```sh
core@ip-10-0-23-37 ~ $ grep etcd /opt/tectonic/manifests/kube-apiserver.yaml
- --etcd-servers=http://10.0.23.31:2379
```

## Connect to an etcd node

```sh
# From the master node
$ ssh core@10.0.23.31
```

To inspect the `etcd-member` service logs, execute:
```sh
$ systemctl status etcd-member && journalctl etcd-member
```
53 changes: 53 additions & 0 deletions Documentation/troubleshooting/master-nodes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Troubleshooting master nodes using SSH

Tectonic worker nodes are usually assigned to a public IP address. To debug a master node, SSH to it or use a VPN connected to the internal network.

View logs on the worker node by using `journalctl -xe` or similar tools. [Reading the system log][journalctl] has more information.

If the cluster is deployed on AWS, to check if the `init-assets` service started successfully, execute:
```sh
$ systemctl status init-assets && journalctl -u init-assets
$ ls /opt/tectonic
```

To examine if the kubelet log, execute:
```sh
$ systemctl status kubelet && journalctl -u kubelet
```

To examine the status and logs of the bootstrap and target control plane containers, execute:
```sh
$ docker ps -a | grep -v pause | grep apiserver
65faeddd2b78 quay.io/coreos/hyperkube@sha256:297f45919160ea076831cd067833ad3b64c789fcb3491016822e6f867d16dcd5 "/usr/bin/flock /var/" 13 minutes ago Up 13 minutes k8s_kube-apiserver_kube-apiserver-90pzs_kube-system_2983ff1c-510e-11e7-bc88-063d653969e3_0
$ docker logs 65faeddd2b78
```

The `bootkube` service is responsible for bootstrapping the temporary control plane and to bootstrap a vanilla Kubernetes control plane.
To examine the `bootkube` logs, execute:
```sh
$ journalctl -u bootkube
...
Jun 14 14:31:39 ip-10-0-23-37 bash[1313]: [ 219.261765] bootkube[5]: Pod Status: pod-checkpointer Pending
Jun 14 14:31:39 ip-10-0-23-37 bash[1313]: [ 219.262217] bootkube[5]: Pod Status: kube-apiserver Running
Jun 14 14:31:39 ip-10-0-23-37 bash[1313]: [ 219.262518] bootkube[5]: Pod Status: kube-scheduler Pending
Jun 14 14:31:39 ip-10-0-23-37 bash[1313]: [ 219.262746] bootkube[5]: Pod Status: kube-controller-manager Pending
...
Jun 14 14:32:44 ip-10-0-23-37 bash[1313]: [ 284.264617] bootkube[5]: Pod Status: kube-controller-manager Running
Jun 14 14:32:49 ip-10-0-23-37 bash[1313]: [ 289.263245] bootkube[5]: Pod Status: pod-checkpointer Running
Jun 14 14:32:49 ip-10-0-23-37 bash[1313]: [ 289.263932] bootkube[5]: Pod Status: kube-apiserver Running
Jun 14 14:32:49 ip-10-0-23-37 bash[1313]: [ 289.264715] bootkube[5]: Pod Status: kube-controller-manager Running
...
Jun 14 14:34:29 ip-10-0-23-37 bash[1313]: [ 389.299380] bootkube[5]: Tearing down temporary bootstrap control plane...
Jun 14 14:34:29 ip-10-0-23-37 systemd[1]: Started Bootstrap a Kubernetes cluster.
```

The `tectonic` service is responsible for installing the actual Tectonic assets on the bootstrapped vanilla cluster.
To examine the `tectonic` installation logs, execute:
```sh
$ journalctl -fu tectonic
Jun 14 14:36:22 ip-10-0-23-37 bash[4763]: [ 502.655337] hyperkube[5]: Pods not available yet, waiting for 5 seconds (10)
Jun 14 14:36:27 ip-10-0-23-37 bash[4763]: [ 507.955606] hyperkube[5]: Tectonic installation is done
Jun 14 14:36:28 ip-10-0-23-37 systemd[1]: Started Bootstrap a Tectonic cluster.
```

[journalctl]: https://github.com/coreos/docs/blob/master/os/reading-the-system-log.md
6 changes: 5 additions & 1 deletion Documentation/troubleshooting/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@
This directory contains documents about troubleshooting Tectonic clusters.

* [Troubleshooting Tectonic Installer][installer-terraform] describes troubleshooting the installation process itself, including reapplying after a partially completed run, and errors related to missing Terraform modules and failed `tectonic.service` unit.
* [Troubleshooting worker nodes using SSH][worker-nodes] describes how to SSH into a controller or worker node to troubleshoot at the host level.
* [Troubleshooting worker nodes using SSH][worker-nodes] describes how to SSH into a master or worker node to troubleshoot at the host level.
* [Troubleshooting master nodes using SSH][master-nodes] describes how to SSH into a master node to troubleshoot at the host level.
* [Troubleshooting etcd nodes using SSH][etcd-nodes] describes how to SSH into a etcd node to troubleshoot at the host level.
* [Disaster recovery of Scheduler and Controller Manager pods][controller-recovery] describes how to recover a Kubernetes cluster from the failure of certain control plane components.
* [Etcd snapshot troubleshooting][etcd-backup-restore] explains how to spin up a local Kubernetes API server from a backup of another cluster's etcd state for troubleshooting.
* The [Tectonic FAQ][faq] answers some common questions about Tectonic versioning, licensing, and other general matters.
Expand All @@ -14,3 +16,5 @@ This directory contains documents about troubleshooting Tectonic clusters.
[faq]: faq.md
[installer-terraform]: installer-terraform.md
[worker-nodes]: worker-nodes.md
[master-nodes]: master-nodes.md
[etcd-nodes]: etcd-nodes.md
12 changes: 12 additions & 0 deletions Documentation/troubleshooting/worker-nodes.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,5 +62,17 @@ $ ssh core@192.0.2.18

View logs on the worker node by using `journalctl -xe` or similar tools. [Reading the system log][journalctl] has more information.

To examine the kubelet logs, execute:
```sh
# From worker node
$ journalctl -u kubelet
```

To examine the status and logs of potentially failed containers, execute:
```sh
$ docker ps -a | grep -v pause
...
$ docker logs ...
```

[journalctl]: https://github.com/coreos/docs/blob/master/os/reading-the-system-log.md
6 changes: 6 additions & 0 deletions installer/tests/sanity/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,9 @@ Tests can then be run using `go test`.
go test -v -i github.com/coreos/tectonic-installer/installer/tests/sanity
go test -v github.com/coreos/tectonic-installer/installer/tests/sanity
```

To build the `sanity` binary for smoke tests, execute:
```sh
$ cd installer
$ make bin/sanity
```
63 changes: 60 additions & 3 deletions tests/smoke/aws/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,31 @@ All Tectonic smoke tests on AWS are run using the `smoke.sh` script found in thi
./smoke.sh help
```

## Environment
To begin, verify that the following environment variables are set:

- `AWS_PROFILE` or alternatively `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`: These credentials are used by Terraform to spawn clusters on AWS.
- `TF_VAR_tectonic_pull_secret_path` and `TF_VAR_tectonic_license_path`: The local path to the pull secret and Tectonic license file.
- `TF_VAR_tectonic_aws_ssh_key`: The AWS ssh key pair which enables ssh'ing into the created machines using the `core` user.
It must be present in AWS under "EC2 -> Network & Security -> Key Pairs".
- (optional) `BUILD_ID`: Any number >= 1. Based on this number the region will be selected of the deployed cluster.
See the `REGIONS` variable under `smoke.sh` for details.
- (optional) `BRANCH_NAME`: The local branch name used as an infix for cluster names.
A sensible value is `git rev-parse --abbrev-ref HEAD`.

Example:
```
$ export AWS_ACCESS_KEY_ID=AKIAIQ5TVFGQ7CKWD6IA
$ export AWS_SECRET_ACCESS_KEY_ID=rtp62V7H/JDY3cNBAs5vA0coaTou/OQbqJk96Hws
$ export TF_VAR_tectonic_license_path="/home/user/tectonic-license"
$ export TF_VAR_tectonic_pull_secret_path="/home/user/coreos-inc/pull-secret"
$ export TF_VAR_tectonic_aws_ssh_key="user"
```

## Assume Role
Smoke tests should be run with a role with limited privileges to ensure that the Tectonic Installer is as locked-down as possible.
The following steps create or update a role with restricted access and assume that role for testing purposes.
To begin, verify that the `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables are set in the current shell.
Now, edit the trust policy located at `Documentation/files/aws-sts-trust-policy.json` to include the ARN for the AWS user to be used for testing.
Edit the trust policy located at `Documentation/files/aws-sts-trust-policy.json` to include the ARN for the AWS user to be used for testing.
Then, define a role name, e.g. `tectonic-installer`, and use the `smoke.sh` script to create a role or update an existing one:
```sh
export ROLE_NAME=tectonic-installer
Expand All @@ -32,13 +52,17 @@ Now that the role exists, assume the role:
source ./smoke.sh assume-role "$ROLE_NAME"
```

Note that this step can be skipped for local testing.

## Create and Test Cluster
Once the role has been assumed, select a Tectonic configuration to test, e.g. `aws-exp.tfvars` and plan a cluster:
Once the role has been assumed, select a Tectonic configuration to test, e.g. `vars/aws-exp.tfvars` and plan a cluster:
```sh
export TEST_VARS=vars/aws-exp.tfvars
./smoke.sh plan $TEST_VARS
```

This will create a terraform state directory in the project's top-level build directory, i.e. `/build/aws-exp-master-1012345678901`.

Continue by actually creating the cluster:
```sh
./smoke.sh create $TEST_VARS
Expand All @@ -47,10 +71,43 @@ Continue by actually creating the cluster:
Finally, test the cluster:
```sh
./smoke.sh test $TEST_VARS
=== RUN TestCluster
=== RUN TestCluster/APIAvailable
=== RUN TestCluster/AllNodesRunning
=== RUN TestCluster/GetLogs
=== RUN TestCluster/AllPodsRunning
=== RUN TestCluster/KillAPIServer
...
PASS
```

## Cleanup
Once all testing has concluded, clean up the AWS resources that were created:
```sh
./smoke.sh destroy $TEST_VARS
```

## Sanity test cheatsheet

To be able to ssh into the created machines, determine the generated cluster name and use the [AWS client](http://docs.aws.amazon.com/cli/latest/userguide/installing.html) to retrieve the public IP address or search for nodes having the cluster name via the AWS Web UI in "EC2 -> Instances":

```sh
$ ls build
aws-exp-master-1012345678901

$ export CLUSTER_NAME=aws-exp-master-1012345678901

$ aws autoscaling describe-auto-scaling-groups \
| jq -r '.AutoScalingGroups[] | select(.AutoScalingGroupName | contains("'${CLUSTER_NAME}'")) | .Instances[].InstanceId' \
| xargs aws ec2 describe-instances --instance-ids \
| jq '.Reservations[].Instances[] | select(.PublicIpAddress != null) | .PublicIpAddress'
"52.15.184.15"

$ ssh -A core@52.15.184.15
```

Once connected to the master node, follow the [troubleshooting guide](../../../Documentation/troubleshooting/troubleshooting.md) for master, worker, and etcd nodes to investigate the following checklist:

- SSH connectivity to the master/worker/etcd nodes
- Successful start of all relevant installation service units on the corresponding nodes
- Successful login to the Tectonic Console
2 changes: 2 additions & 0 deletions tests/smoke/aws/smoke.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ set -ex -o pipefail
shopt -s expand_aliases
DIR="$(cd "$(dirname "$0")" && pwd)"
WORKSPACE=${WORKSPACE:-"$(cd "$DIR"/../../.. && pwd)"}
BUILD_ID=${BUILD_ID:-1}
BRANCH_NAME=${BRANCH_NAME:-$(git rev-parse --abbrev-ref HEAD)}
# Alias filter for convenience
# shellcheck disable=SC2139
alias filter="$WORKSPACE"/installer/scripts/filter.sh
Expand Down
2 changes: 0 additions & 2 deletions tests/smoke/aws/vars/aws-ca.tfvars
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,6 @@ tectonic_ca_cert = "../../examples/fake-creds/ca.crt"

tectonic_ca_key = "../../examples/fake-creds/ca.key"

tectonic_aws_ssh_key = "jenkins"

tectonic_aws_master_ec2_type = "m4.large"

tectonic_aws_worker_ec2_type = "m4.large"
Expand Down
2 changes: 0 additions & 2 deletions tests/smoke/aws/vars/aws-exp.tfvars
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,6 @@ tectonic_ca_cert = ""

tectonic_ca_key = ""

tectonic_aws_ssh_key = "jenkins"

tectonic_aws_master_ec2_type = "m4.large"

tectonic_aws_worker_ec2_type = "m4.large"
Expand Down
2 changes: 0 additions & 2 deletions tests/smoke/aws/vars/aws.tfvars
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,6 @@ tectonic_ca_cert = ""

tectonic_ca_key = ""

tectonic_aws_ssh_key = "jenkins"

tectonic_aws_master_ec2_type = "m4.large"

tectonic_aws_worker_ec2_type = "m4.large"
Expand Down

0 comments on commit 68feb5d

Please sign in to comment.