diff --git a/Documentation/troubleshooting/etcd-nodes.md b/Documentation/troubleshooting/etcd-nodes.md new file mode 100644 index 0000000000..d40c6bb46f --- /dev/null +++ b/Documentation/troubleshooting/etcd-nodes.md @@ -0,0 +1,56 @@ +# Troubleshooting worker nodes using SSH + +Tectonic worker nodes are not assigned a public IP address, only the controller node. To debug a worker node, SSH to it through a controller (bastion host) or use a VPN connected to the internal network. + +To do so, perform the following: + +## Set up SSH agent forwarding + +Once a passphrase of the local ssh key is added to `ssh-agent`, you will not be prompted for the credentials the next time connecting to nodes via SSH or SCP. The following instructions outline adding a passphrase to the `ssh-agent` on the system. + + 1. At the terminal, enter: + + `$ eval ssh-agent` + + 2. Run the following: + + `$ ssh-add` + + The `ssh-add` command prompts for a private key passphrase and adds it to the list maintained by `ssh-agent`. + + 3. Enter your private key passphrase. + + 4. Before logging out, run the following: + + `$ kill $SSH_AGENT_PID` + + To automatically run this command when logging out, place it in the `.logout` file if you are using csh or tcsh. Place the command in the `.bash_logout` file if you are using bash. + +## Connect to a master node + +SSH to a master node with its `EXTERNAL-IP`, providing the `-A` flag to forward the local `ssh-agent`. Add the `-i` option giving the location of the ssh key known to Tectonic: + +```bash +$ ssh -A core@10.0.23.37 -i /path/to/tectonic/cluster/ssh/key +``` + +## Get the IP address of the etcd nodes + +Run the following command on the master instance: + +```sh +core@ip-10-0-23-37 ~ $ grep etcd /opt/tectonic/manifests/kube-apiserver.yaml + - --etcd-servers=http://10.0.23.31:2379 +``` + +## Connect to an etcd node + +```sh +# From the master node +$ ssh core@10.0.23.31 +``` + +To inspect the `etcd-member` service logs, execute: +```sh +$ systemctl status etcd-member && journalctl etcd-member +``` diff --git a/Documentation/troubleshooting/master-nodes.md b/Documentation/troubleshooting/master-nodes.md new file mode 100644 index 0000000000..ed03a0b9ce --- /dev/null +++ b/Documentation/troubleshooting/master-nodes.md @@ -0,0 +1,53 @@ +# Troubleshooting master nodes using SSH + +Tectonic worker nodes are usually assigned to a public IP address. To debug a master node, SSH to it or use a VPN connected to the internal network. + + View logs on the worker node by using `journalctl -xe` or similar tools. [Reading the system log][journalctl] has more information. + +If the cluster is deployed on AWS, to check if the `init-assets` service started successfully, execute: +```sh +$ systemctl status init-assets && journalctl -u init-assets +$ ls /opt/tectonic +``` + +To examine if the kubelet log, execute: +```sh +$ systemctl status kubelet && journalctl -u kubelet +``` + +To examine the status and logs of the bootstrap and target control plane containers, execute: +```sh +$ docker ps -a | grep -v pause | grep apiserver +65faeddd2b78 quay.io/coreos/hyperkube@sha256:297f45919160ea076831cd067833ad3b64c789fcb3491016822e6f867d16dcd5 "/usr/bin/flock /var/" 13 minutes ago Up 13 minutes k8s_kube-apiserver_kube-apiserver-90pzs_kube-system_2983ff1c-510e-11e7-bc88-063d653969e3_0 +$ docker logs 65faeddd2b78 +``` + +The `bootkube` service is responsible for bootstrapping the temporary control plane and to bootstrap a vanilla Kubernetes control plane. +To examine the `bootkube` logs, execute: +```sh +$ journalctl -u bootkube +... +Jun 14 14:31:39 ip-10-0-23-37 bash[1313]: [ 219.261765] bootkube[5]: Pod Status: pod-checkpointer Pending +Jun 14 14:31:39 ip-10-0-23-37 bash[1313]: [ 219.262217] bootkube[5]: Pod Status: kube-apiserver Running +Jun 14 14:31:39 ip-10-0-23-37 bash[1313]: [ 219.262518] bootkube[5]: Pod Status: kube-scheduler Pending +Jun 14 14:31:39 ip-10-0-23-37 bash[1313]: [ 219.262746] bootkube[5]: Pod Status: kube-controller-manager Pending +... +Jun 14 14:32:44 ip-10-0-23-37 bash[1313]: [ 284.264617] bootkube[5]: Pod Status: kube-controller-manager Running +Jun 14 14:32:49 ip-10-0-23-37 bash[1313]: [ 289.263245] bootkube[5]: Pod Status: pod-checkpointer Running +Jun 14 14:32:49 ip-10-0-23-37 bash[1313]: [ 289.263932] bootkube[5]: Pod Status: kube-apiserver Running +Jun 14 14:32:49 ip-10-0-23-37 bash[1313]: [ 289.264715] bootkube[5]: Pod Status: kube-controller-manager Running +... +Jun 14 14:34:29 ip-10-0-23-37 bash[1313]: [ 389.299380] bootkube[5]: Tearing down temporary bootstrap control plane... +Jun 14 14:34:29 ip-10-0-23-37 systemd[1]: Started Bootstrap a Kubernetes cluster. +``` + +The `tectonic` service is responsible for installing the actual Tectonic assets on the bootstrapped vanilla cluster. +To examine the `tectonic` installation logs, execute: +```sh +$ journalctl -fu tectonic +Jun 14 14:36:22 ip-10-0-23-37 bash[4763]: [ 502.655337] hyperkube[5]: Pods not available yet, waiting for 5 seconds (10) +Jun 14 14:36:27 ip-10-0-23-37 bash[4763]: [ 507.955606] hyperkube[5]: Tectonic installation is done +Jun 14 14:36:28 ip-10-0-23-37 systemd[1]: Started Bootstrap a Tectonic cluster. +``` + +[journalctl]: https://github.com/coreos/docs/blob/master/os/reading-the-system-log.md diff --git a/Documentation/troubleshooting/troubleshooting.md b/Documentation/troubleshooting/troubleshooting.md index e23466e31d..5bd67beedc 100644 --- a/Documentation/troubleshooting/troubleshooting.md +++ b/Documentation/troubleshooting/troubleshooting.md @@ -3,7 +3,9 @@ This directory contains documents about troubleshooting Tectonic clusters. * [Troubleshooting Tectonic Installer][installer-terraform] describes troubleshooting the installation process itself, including reapplying after a partially completed run, and errors related to missing Terraform modules and failed `tectonic.service` unit. -* [Troubleshooting worker nodes using SSH][worker-nodes] describes how to SSH into a controller or worker node to troubleshoot at the host level. +* [Troubleshooting worker nodes using SSH][worker-nodes] describes how to SSH into a master or worker node to troubleshoot at the host level. +* [Troubleshooting master nodes using SSH][master-nodes] describes how to SSH into a master node to troubleshoot at the host level. +* [Troubleshooting etcd nodes using SSH][etcd-nodes] describes how to SSH into a etcd node to troubleshoot at the host level. * [Disaster recovery of Scheduler and Controller Manager pods][controller-recovery] describes how to recover a Kubernetes cluster from the failure of certain control plane components. * [Etcd snapshot troubleshooting][etcd-backup-restore] explains how to spin up a local Kubernetes API server from a backup of another cluster's etcd state for troubleshooting. * The [Tectonic FAQ][faq] answers some common questions about Tectonic versioning, licensing, and other general matters. @@ -14,3 +16,5 @@ This directory contains documents about troubleshooting Tectonic clusters. [faq]: faq.md [installer-terraform]: installer-terraform.md [worker-nodes]: worker-nodes.md +[master-nodes]: master-nodes.md +[etcd-nodes]: etcd-nodes.md diff --git a/Documentation/troubleshooting/worker-nodes.md b/Documentation/troubleshooting/worker-nodes.md index 971c10f7e3..f2bd79b3e0 100644 --- a/Documentation/troubleshooting/worker-nodes.md +++ b/Documentation/troubleshooting/worker-nodes.md @@ -62,5 +62,17 @@ $ ssh core@192.0.2.18 View logs on the worker node by using `journalctl -xe` or similar tools. [Reading the system log][journalctl] has more information. +To examine the kubelet logs, execute: +```sh +# From worker node +$ journalctl -u kubelet +``` + +To examine the status and logs of potentially failed containers, execute: +```sh +$ docker ps -a | grep -v pause +... +$ docker logs ... +``` [journalctl]: https://github.com/coreos/docs/blob/master/os/reading-the-system-log.md diff --git a/installer/tests/sanity/README.md b/installer/tests/sanity/README.md index 619a62cabc..5f98bfcbc5 100644 --- a/installer/tests/sanity/README.md +++ b/installer/tests/sanity/README.md @@ -20,3 +20,9 @@ Tests can then be run using `go test`. go test -v -i github.com/coreos/tectonic-installer/installer/tests/sanity go test -v github.com/coreos/tectonic-installer/installer/tests/sanity ``` + +To build the `sanity` binary for smoke tests, execute: +```sh +$ cd installer +$ make bin/sanity +``` diff --git a/tests/smoke/aws/README.md b/tests/smoke/aws/README.md index ab25c62b77..9472a3da8d 100644 --- a/tests/smoke/aws/README.md +++ b/tests/smoke/aws/README.md @@ -15,11 +15,31 @@ All Tectonic smoke tests on AWS are run using the `smoke.sh` script found in thi ./smoke.sh help ``` +## Environment +To begin, verify that the following environment variables are set: + +- `AWS_PROFILE` or alternatively `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`: These credentials are used by Terraform to spawn clusters on AWS. +- `TF_VAR_tectonic_pull_secret_path` and `TF_VAR_tectonic_license_path`: The local path to the pull secret and Tectonic license file. +- `TF_VAR_tectonic_aws_ssh_key`: The AWS ssh key pair which enables ssh'ing into the created machines using the `core` user. + It must be present in AWS under "EC2 -> Network & Security -> Key Pairs". +- (optional) `BUILD_ID`: Any number >= 1. Based on this number the region will be selected of the deployed cluster. + See the `REGIONS` variable under `smoke.sh` for details. +- (optional) `BRANCH_NAME`: The local branch name used as an infix for cluster names. + A sensible value is `git rev-parse --abbrev-ref HEAD`. + +Example: +``` +$ export AWS_ACCESS_KEY_ID=AKIAIQ5TVFGQ7CKWD6IA +$ export AWS_SECRET_ACCESS_KEY_ID=rtp62V7H/JDY3cNBAs5vA0coaTou/OQbqJk96Hws +$ export TF_VAR_tectonic_license_path="/home/user/tectonic-license" +$ export TF_VAR_tectonic_pull_secret_path="/home/user/coreos-inc/pull-secret" +$ export TF_VAR_tectonic_aws_ssh_key="user" +``` + ## Assume Role Smoke tests should be run with a role with limited privileges to ensure that the Tectonic Installer is as locked-down as possible. The following steps create or update a role with restricted access and assume that role for testing purposes. -To begin, verify that the `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables are set in the current shell. -Now, edit the trust policy located at `Documentation/files/aws-sts-trust-policy.json` to include the ARN for the AWS user to be used for testing. +Edit the trust policy located at `Documentation/files/aws-sts-trust-policy.json` to include the ARN for the AWS user to be used for testing. Then, define a role name, e.g. `tectonic-installer`, and use the `smoke.sh` script to create a role or update an existing one: ```sh export ROLE_NAME=tectonic-installer @@ -32,13 +52,17 @@ Now that the role exists, assume the role: source ./smoke.sh assume-role "$ROLE_NAME" ``` +Note that this step can be skipped for local testing. + ## Create and Test Cluster -Once the role has been assumed, select a Tectonic configuration to test, e.g. `aws-exp.tfvars` and plan a cluster: +Once the role has been assumed, select a Tectonic configuration to test, e.g. `vars/aws-exp.tfvars` and plan a cluster: ```sh export TEST_VARS=vars/aws-exp.tfvars ./smoke.sh plan $TEST_VARS ``` +This will create a terraform state directory in the project's top-level build directory, i.e. `/build/aws-exp-master-1012345678901`. + Continue by actually creating the cluster: ```sh ./smoke.sh create $TEST_VARS @@ -47,6 +71,14 @@ Continue by actually creating the cluster: Finally, test the cluster: ```sh ./smoke.sh test $TEST_VARS +=== RUN TestCluster +=== RUN TestCluster/APIAvailable +=== RUN TestCluster/AllNodesRunning +=== RUN TestCluster/GetLogs +=== RUN TestCluster/AllPodsRunning +=== RUN TestCluster/KillAPIServer +... +PASS ``` ## Cleanup @@ -54,3 +86,28 @@ Once all testing has concluded, clean up the AWS resources that were created: ```sh ./smoke.sh destroy $TEST_VARS ``` + +## Sanity test cheatsheet + +To be able to ssh into the created machines, determine the generated cluster name and use the [AWS client](http://docs.aws.amazon.com/cli/latest/userguide/installing.html) to retrieve the public IP address or search for nodes having the cluster name via the AWS Web UI in "EC2 -> Instances": + +```sh +$ ls build +aws-exp-master-1012345678901 + +$ export CLUSTER_NAME=aws-exp-master-1012345678901 + +$ aws autoscaling describe-auto-scaling-groups \ + | jq -r '.AutoScalingGroups[] | select(.AutoScalingGroupName | contains("'${CLUSTER_NAME}'")) | .Instances[].InstanceId' \ + | xargs aws ec2 describe-instances --instance-ids \ + | jq '.Reservations[].Instances[] | select(.PublicIpAddress != null) | .PublicIpAddress' +"52.15.184.15" + +$ ssh -A core@52.15.184.15 +``` + +Once connected to the master node, follow the [troubleshooting guide](../../../Documentation/troubleshooting/troubleshooting.md) for master, worker, and etcd nodes to investigate the following checklist: + +- SSH connectivity to the master/worker/etcd nodes +- Successful start of all relevant installation service units on the corresponding nodes +- Successful login to the Tectonic Console diff --git a/tests/smoke/aws/smoke.sh b/tests/smoke/aws/smoke.sh index ffeb5ef238..1b6b93276f 100755 --- a/tests/smoke/aws/smoke.sh +++ b/tests/smoke/aws/smoke.sh @@ -3,6 +3,8 @@ set -ex -o pipefail shopt -s expand_aliases DIR="$(cd "$(dirname "$0")" && pwd)" WORKSPACE=${WORKSPACE:-"$(cd "$DIR"/../../.. && pwd)"} +BUILD_ID=${BUILD_ID:-1} +BRANCH_NAME=${BRANCH_NAME:-$(git rev-parse --abbrev-ref HEAD)} # Alias filter for convenience # shellcheck disable=SC2139 alias filter="$WORKSPACE"/installer/scripts/filter.sh diff --git a/tests/smoke/aws/vars/aws-ca.tfvars b/tests/smoke/aws/vars/aws-ca.tfvars index 3851bd31c5..9ce6dbd24b 100644 --- a/tests/smoke/aws/vars/aws-ca.tfvars +++ b/tests/smoke/aws/vars/aws-ca.tfvars @@ -18,8 +18,6 @@ tectonic_ca_cert = "../../examples/fake-creds/ca.crt" tectonic_ca_key = "../../examples/fake-creds/ca.key" -tectonic_aws_ssh_key = "jenkins" - tectonic_aws_master_ec2_type = "m4.large" tectonic_aws_worker_ec2_type = "m4.large" diff --git a/tests/smoke/aws/vars/aws-exp.tfvars b/tests/smoke/aws/vars/aws-exp.tfvars index d9003e7603..6700910bbc 100644 --- a/tests/smoke/aws/vars/aws-exp.tfvars +++ b/tests/smoke/aws/vars/aws-exp.tfvars @@ -14,8 +14,6 @@ tectonic_ca_cert = "" tectonic_ca_key = "" -tectonic_aws_ssh_key = "jenkins" - tectonic_aws_master_ec2_type = "m4.large" tectonic_aws_worker_ec2_type = "m4.large" diff --git a/tests/smoke/aws/vars/aws.tfvars b/tests/smoke/aws/vars/aws.tfvars index 8b33dc7ff7..c585bcbb41 100644 --- a/tests/smoke/aws/vars/aws.tfvars +++ b/tests/smoke/aws/vars/aws.tfvars @@ -18,8 +18,6 @@ tectonic_ca_cert = "" tectonic_ca_key = "" -tectonic_aws_ssh_key = "jenkins" - tectonic_aws_master_ec2_type = "m4.large" tectonic_aws_worker_ec2_type = "m4.large"