-
Notifications
You must be signed in to change notification settings - Fork 267
Terraform destroy: retry policy #1017
Comments
This is a high-priority. P0. We need to drop everything and fix this ASAP |
I investigated the problem today and here is what happens. The destruction fails at detaching the internet gateway The only dependent object on the above is Unfortunately the error response from AWS indicates that there still public IP addresses assigned to that gateway responding with a HTTP response code 400 and the following error message:
According to the error message internally in AWS there is a relationship also to the Elastic IPs Looking at the deletion timestamps for the
So why does detaching Since there is no direct relationship towards Eventually |
This prevents the internet gateway to be deleted before Elastic IPs which can cause destroy failure. Fixes coreos#1017
I verified that putting an explicit dependency from |
…#1053) * modules/aws: add dep from aws_eip.nat_eip to aws_internet_gateway.igw This prevents the internet gateway to be deleted before Elastic IPs which can cause destroy failure. Fixes #1017 * modules/aws: remove explicit dep from nat_gw to nat_eip It is expressed in allocation_id already and can cause the graph optimization to be distorted.
The issue happened again in a few PRs, including https://jenkins-tectonic-installer.prod.coreos.systems/blue/organizations/jenkins/tectonic-installer/detail/PR-1074/2/pipeline. Hence, re-opening :( |
This is very unfortunate :-( I could reproduce (with a patched terraform version which has max-retries set to 2) that we still might get 400 responses from AWS even if destroying the internet gateway happens after destroying EIPs:
The only countermeasure here now that comes to my mind is to make the |
Closing here as we have a few retries now. Should track resolution in #1054. SPC engineers are working on implementing the timeout lifecycle on the resource. |
This increases the not found checks to 90, which, delayed by 10 seconds, match the total timeout of 15 minutes. Fixes coreos/tectonic-installer#1017
This increases the not found checks to 90, which, delayed by 10 seconds, match the total timeout of 15 minutes. Fixes coreos/tectonic-installer#1017
upstream PR: hashicorp/terraform-provider-aws#1021 |
@s-urbaniak Can we close here as hashicorp/terraform-provider-aws#1021 was merged? |
@mxinden definitely, thanks for the catch! |
This increases the not found checks to 90, which, delayed by 10 seconds, match the total timeout of 15 minutes. Fixes coreos/tectonic-installer#1017
We have been running more and more into:
We currently destroy once as part of the smoke test.
If this fail (like here), we mark the test as failed and try to destroy 3x more times (see Jenkinsfile)
While the fix is to fix this in the code base or upstream, would you like to re-consider the policy a bit?
My suggestion would be to:
-refresh=false
-refresh=true
andTF_LOG=TRACE
The text was updated successfully, but these errors were encountered: