Skip to content
This repository has been archived by the owner on Feb 5, 2020. It is now read-only.

Terraform destroy: retry policy #1017

Closed
Quentin-M opened this issue Jun 7, 2017 · 9 comments
Closed

Terraform destroy: retry policy #1017

Quentin-M opened this issue Jun 7, 2017 · 9 comments

Comments

@Quentin-M
Copy link
Contributor

Quentin-M commented Jun 7, 2017

We have been running more and more into:

�[31mError applying plan:

1 error(s) occurred:

* module.vpc.aws_internet_gateway.igw (destroy): 1 error(s) occurred:

* aws_internet_gateway.igw: Error waiting for internet gateway (igw-648d5203) to detach: couldn't find resource (31 retries)

We currently destroy once as part of the smoke test.
If this fail (like here), we mark the test as failed and try to destroy 3x more times (see Jenkinsfile)

While the fix is to fix this in the code base or upstream, would you like to re-consider the policy a bit?

My suggestion would be to:

  • Still try to destroy once,
  • Still fail the test otherwise
  • Destroy another time normally
  • Destroy with -refresh=false
  • Destroy with -refresh=true and TF_LOG=TRACE
@sym3tri
Copy link
Contributor

sym3tri commented Jun 9, 2017

This is a high-priority. P0. We need to drop everything and fix this ASAP

@s-urbaniak
Copy link
Contributor

I investigated the problem today and here is what happens. The destruction fails at detaching the internet gateway modules.aws.vpc.aws_internet_gateway.igw.

The only dependent object on the above is module.aws.vpc.aws_route.igw_route, hence Terraform destroys the route first and afterwards the internet gateway which is correct from a DAG traversal perspective.

Unfortunately the error response from AWS indicates that there still public IP addresses assigned to that gateway responding with a HTTP response code 400 and the following error message:

2017/06/12 14:28:39 [DEBUG] plugin: terraform: aws-provider (internal) 2017/06/12 14:28:39 [DEBUG] [aws-sdk-go] DEBUG: Request ec2/DetachInternetGateway Details:
...
2017/06/12 14:28:39 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/DetachInternetGateway Details:
2017/06/12 14:28:39 [DEBUG] plugin: terraform: ---[ RESPONSE ]--------------------------------------
2017/06/12 14:28:39 [DEBUG] plugin: terraform: HTTP/1.1 400 Bad Request
2017/06/12 14:28:39 [DEBUG] plugin: terraform: <Response><Errors><Error><Code>DependencyViolation</Code><Message>Network vpc-bbe35bd2 has some mapped public address(es). Please unmap those public address(es) before detaching the gateway.</Message></Error></Errors><RequestID>b4a0b4bc-d130-4889-8680-a560bce90be3</RequestID></Response>

According to the error message internally in AWS there is a relationship also to the Elastic IPs modules.aws.vpc.aws_eip.nat_eip. Terraform though has no DAG-based relationship between them, hence it deletes those resources concurrently.

Looking at the deletion timestamps for the aws_eip.nat_eip resources one can see that detaching the internet gateway aws_internet_gateway.igw succeeds immediately just after that:

...
ESC[0mESC[1mmodule.vpc.aws_eip.nat_eip.0: Destruction completeESC[0mESC[0m
2017/06/12 14:30:45 [TRACE] Preserving existing state lineage "0c9f710e-b766-4983-a38a-0b7c8258dd0c"
...
ESC[0mESC[1mmodule.vpc.aws_eip.nat_eip.1: Destruction completeESC[0mESC[0m
2017/06/12 14:30:45 [DEBUG] root.vpc: eval: *terraform.EvalUpdateStateHook
...
2017/06/12 14:30:45 [DEBUG] plugin: terraform: aws-provider (internal) 2017/06/12 14:30:45 [DEBUG] [aws-sdk-go] DEBUG: Request ec2/DetachInternetGateway Details:
...
14:30:45 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/DetachInternetGateway Details:
2017/06/12 14:30:45 [DEBUG] plugin: terraform: ---[ RESPONSE ]--------------------------------------
2017/06/12 14:30:45 [DEBUG] plugin: terraform: HTTP/1.1 200 OK
2017/06/12 14:30:45 [DEBUG] plugin: terraform: aws-provider (internal) 2017/06/12 14:30:45 [DEBUG] Waiting for state to become: [success]

So why does detaching aws_internet_gateway.igw actually eventually work at all and fail just sometimes?

Since there is no direct relationship towards aws_eip.nat_eip, Terraform retries to detach aws_internet_gateway.igw up to 30 times, delayed by 10 seconds between each retry (giving a total timeout of 5 minutes) to destroy the internet gateway according to https://github.com/hashicorp/terraform/blob/v0.9.6/builtin/providers/aws/resource_aws_internet_gateway.go#L243-L244.

Eventually aws_eip.nat_eip gets destroyed, causing aws_internet_gateway.igw to be detached successfully. But if destroying aws_eip.nat_eip takes longer than 5 minutes after starting destroying aws_internet_gateway.igw the whole destroy process fails with the above error message.

s-urbaniak pushed a commit to s-urbaniak/tectonic-installer that referenced this issue Jun 12, 2017
This prevents the internet gateway to be deleted before Elastic IPs
which can cause destroy failure.

Fixes coreos#1017
@s-urbaniak
Copy link
Contributor

I verified that putting an explicit dependency from aws_eip.nat_eip towards aws_internet_gateway.igw at least ensures that deletion of aws_eip.nat_eip happens before aws_internet_gateway.igw drawing the retry to be an effective countermeasure until the deletion propagates inside the AWS control plane.

alexsomesan pushed a commit that referenced this issue Jun 12, 2017
…#1053)

* modules/aws: add dep from aws_eip.nat_eip to aws_internet_gateway.igw

This prevents the internet gateway to be deleted before Elastic IPs
which can cause destroy failure.

Fixes #1017

* modules/aws: remove explicit dep from nat_gw to nat_eip

It is expressed in allocation_id already and can cause the graph
optimization to be distorted.
@Quentin-M
Copy link
Contributor Author

The issue happened again in a few PRs, including https://jenkins-tectonic-installer.prod.coreos.systems/blue/organizations/jenkins/tectonic-installer/detail/PR-1074/2/pipeline. Hence, re-opening :(

@s-urbaniak
Copy link
Contributor

This is very unfortunate :-( I could reproduce (with a patched terraform version which has max-retries set to 2) that we still might get 400 responses from AWS even if destroying the internet gateway happens after destroying EIPs:

2017/06/14 09:20:10 [DEBUG] plugin: terraform: aws-provider (internal) 2017/06/14 09:20:10 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/DetachInternetGateway Details:
2017/06/14 09:20:10 [DEBUG] plugin: terraform: ---[ RESPONSE ]--------------------------------------
2017/06/14 09:20:10 [DEBUG] plugin: terraform: HTTP/1.1 400 Bad Request
2017/06/14 09:20:10 [DEBUG] plugin: terraform: Connection: close
2017/06/14 09:20:10 [DEBUG] plugin: terraform: Transfer-Encoding: chunked
2017/06/14 09:20:10 [DEBUG] plugin: terraform: Date: Wed, 14 Jun 2017 07:23:16 GMT
2017/06/14 09:20:10 [DEBUG] plugin: terraform: Server: AmazonEC2
2017/06/14 09:20:10 [DEBUG] plugin: terraform: 
2017/06/14 09:20:10 [DEBUG] plugin: terraform: 
2017/06/14 09:20:10 [DEBUG] plugin: terraform: -----------------------------------------------------
2017/06/14 09:20:10 [DEBUG] plugin: terraform: aws-provider (internal) 2017/06/14 09:20:10 [DEBUG] [aws-sdk-go] <?xml version="1.0" encoding="UTF-8"?>
2017/06/14 09:20:10 [DEBUG] plugin: terraform: <Response><Errors><Error><Code>DependencyViolation</Code><Message>Network vpc-2bd06b42 has some mapped public address(es). Please unmap those public address(es) before detaching the gateway.</Message></Error></Errors><RequestID>de0b963a-e6aa-4306-a9ca-fe6e662bf75f</RequestID></Response>

The only countermeasure here now that comes to my mind is to make the internet_gateway resource aware of https://www.terraform.io/docs/configuration/resources.html#timeouts, which implies an upstream change.

@Quentin-M
Copy link
Contributor Author

Closing here as we have a few retries now. Should track resolution in #1054. SPC engineers are working on implementing the timeout lifecycle on the resource.

s-urbaniak pushed a commit to s-urbaniak/terraform that referenced this issue Jun 29, 2017
This increases the not found checks to 90, which, delayed by 10 seconds,
match the total timeout of 15 minutes.

Fixes coreos/tectonic-installer#1017
s-urbaniak pushed a commit to s-urbaniak/terraform that referenced this issue Jun 29, 2017
This increases the not found checks to 90, which, delayed by 10 seconds,
match the total timeout of 15 minutes.

Fixes coreos/tectonic-installer#1017
@s-urbaniak
Copy link
Contributor

upstream PR: hashicorp/terraform-provider-aws#1021

@mxinden
Copy link
Contributor

mxinden commented Jul 20, 2017

@s-urbaniak Can we close here as hashicorp/terraform-provider-aws#1021 was merged?

@s-urbaniak
Copy link
Contributor

@mxinden definitely, thanks for the catch!

s-urbaniak pushed a commit to s-urbaniak/terraform-provider-aws that referenced this issue Sep 6, 2017
This increases the not found checks to 90, which, delayed by 10 seconds,
match the total timeout of 15 minutes.

Fixes coreos/tectonic-installer#1017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants