Terraform destroy: retry policy #1017

Quentin-M · 2017-06-07T19:47:12Z

We have been running more and more into:

�[31mError applying plan:

1 error(s) occurred:

* module.vpc.aws_internet_gateway.igw (destroy): 1 error(s) occurred:

* aws_internet_gateway.igw: Error waiting for internet gateway (igw-648d5203) to detach: couldn't find resource (31 retries)

We currently destroy once as part of the smoke test.
If this fail (like here), we mark the test as failed and try to destroy 3x more times (see Jenkinsfile)

While the fix is to fix this in the code base or upstream, would you like to re-consider the policy a bit?

My suggestion would be to:

Still try to destroy once,
Still fail the test otherwise
Destroy another time normally
Destroy with -refresh=false
Destroy with -refresh=true and TF_LOG=TRACE

The text was updated successfully, but these errors were encountered:

sym3tri · 2017-06-09T23:39:44Z

This is a high-priority. P0. We need to drop everything and fix this ASAP

s-urbaniak · 2017-06-12T12:54:01Z

I investigated the problem today and here is what happens. The destruction fails at detaching the internet gateway modules.aws.vpc.aws_internet_gateway.igw.

The only dependent object on the above is module.aws.vpc.aws_route.igw_route, hence Terraform destroys the route first and afterwards the internet gateway which is correct from a DAG traversal perspective.

Unfortunately the error response from AWS indicates that there still public IP addresses assigned to that gateway responding with a HTTP response code 400 and the following error message:

2017/06/12 14:28:39 [DEBUG] plugin: terraform: aws-provider (internal) 2017/06/12 14:28:39 [DEBUG] [aws-sdk-go] DEBUG: Request ec2/DetachInternetGateway Details:
...
2017/06/12 14:28:39 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/DetachInternetGateway Details:
2017/06/12 14:28:39 [DEBUG] plugin: terraform: ---[ RESPONSE ]--------------------------------------
2017/06/12 14:28:39 [DEBUG] plugin: terraform: HTTP/1.1 400 Bad Request
2017/06/12 14:28:39 [DEBUG] plugin: terraform: <Response><Errors><Error><Code>DependencyViolation</Code><Message>Network vpc-bbe35bd2 has some mapped public address(es). Please unmap those public address(es) before detaching the gateway.</Message></Error></Errors><RequestID>b4a0b4bc-d130-4889-8680-a560bce90be3</RequestID></Response>

According to the error message internally in AWS there is a relationship also to the Elastic IPs modules.aws.vpc.aws_eip.nat_eip. Terraform though has no DAG-based relationship between them, hence it deletes those resources concurrently.

Looking at the deletion timestamps for the aws_eip.nat_eip resources one can see that detaching the internet gateway aws_internet_gateway.igw succeeds immediately just after that:

...
ESC[0mESC[1mmodule.vpc.aws_eip.nat_eip.0: Destruction completeESC[0mESC[0m
2017/06/12 14:30:45 [TRACE] Preserving existing state lineage "0c9f710e-b766-4983-a38a-0b7c8258dd0c"
...
ESC[0mESC[1mmodule.vpc.aws_eip.nat_eip.1: Destruction completeESC[0mESC[0m
2017/06/12 14:30:45 [DEBUG] root.vpc: eval: *terraform.EvalUpdateStateHook
...
2017/06/12 14:30:45 [DEBUG] plugin: terraform: aws-provider (internal) 2017/06/12 14:30:45 [DEBUG] [aws-sdk-go] DEBUG: Request ec2/DetachInternetGateway Details:
...
14:30:45 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/DetachInternetGateway Details:
2017/06/12 14:30:45 [DEBUG] plugin: terraform: ---[ RESPONSE ]--------------------------------------
2017/06/12 14:30:45 [DEBUG] plugin: terraform: HTTP/1.1 200 OK
2017/06/12 14:30:45 [DEBUG] plugin: terraform: aws-provider (internal) 2017/06/12 14:30:45 [DEBUG] Waiting for state to become: [success]

So why does detaching aws_internet_gateway.igw actually eventually work at all and fail just sometimes?

Since there is no direct relationship towards aws_eip.nat_eip, Terraform retries to detach aws_internet_gateway.igw up to 30 times, delayed by 10 seconds between each retry (giving a total timeout of 5 minutes) to destroy the internet gateway according to https://github.com/hashicorp/terraform/blob/v0.9.6/builtin/providers/aws/resource_aws_internet_gateway.go#L243-L244.

Eventually aws_eip.nat_eip gets destroyed, causing aws_internet_gateway.igw to be detached successfully. But if destroying aws_eip.nat_eip takes longer than 5 minutes after starting destroying aws_internet_gateway.igw the whole destroy process fails with the above error message.

This prevents the internet gateway to be deleted before Elastic IPs which can cause destroy failure. Fixes coreos#1017

s-urbaniak · 2017-06-12T13:42:07Z

I verified that putting an explicit dependency from aws_eip.nat_eip towards aws_internet_gateway.igw at least ensures that deletion of aws_eip.nat_eip happens before aws_internet_gateway.igw drawing the retry to be an effective countermeasure until the deletion propagates inside the AWS control plane.

…#1053) * modules/aws: add dep from aws_eip.nat_eip to aws_internet_gateway.igw This prevents the internet gateway to be deleted before Elastic IPs which can cause destroy failure. Fixes #1017 * modules/aws: remove explicit dep from nat_gw to nat_eip It is expressed in allocation_id already and can cause the graph optimization to be distorted.

Quentin-M · 2017-06-13T20:14:35Z

The issue happened again in a few PRs, including https://jenkins-tectonic-installer.prod.coreos.systems/blue/organizations/jenkins/tectonic-installer/detail/PR-1074/2/pipeline. Hence, re-opening :(

Jenkinsfile: retry destroy on AWS Smoke until #1017/#1054 are fixed

s-urbaniak · 2017-06-14T08:08:09Z

This is very unfortunate :-( I could reproduce (with a patched terraform version which has max-retries set to 2) that we still might get 400 responses from AWS even if destroying the internet gateway happens after destroying EIPs:

2017/06/14 09:20:10 [DEBUG] plugin: terraform: aws-provider (internal) 2017/06/14 09:20:10 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/DetachInternetGateway Details:
2017/06/14 09:20:10 [DEBUG] plugin: terraform: ---[ RESPONSE ]--------------------------------------
2017/06/14 09:20:10 [DEBUG] plugin: terraform: HTTP/1.1 400 Bad Request
2017/06/14 09:20:10 [DEBUG] plugin: terraform: Connection: close
2017/06/14 09:20:10 [DEBUG] plugin: terraform: Transfer-Encoding: chunked
2017/06/14 09:20:10 [DEBUG] plugin: terraform: Date: Wed, 14 Jun 2017 07:23:16 GMT
2017/06/14 09:20:10 [DEBUG] plugin: terraform: Server: AmazonEC2
2017/06/14 09:20:10 [DEBUG] plugin: terraform: 
2017/06/14 09:20:10 [DEBUG] plugin: terraform: 
2017/06/14 09:20:10 [DEBUG] plugin: terraform: -----------------------------------------------------
2017/06/14 09:20:10 [DEBUG] plugin: terraform: aws-provider (internal) 2017/06/14 09:20:10 [DEBUG] [aws-sdk-go] <?xml version="1.0" encoding="UTF-8"?>
2017/06/14 09:20:10 [DEBUG] plugin: terraform: <Response><Errors><Error><Code>DependencyViolation</Code><Message>Network vpc-2bd06b42 has some mapped public address(es). Please unmap those public address(es) before detaching the gateway.</Message></Error></Errors><RequestID>de0b963a-e6aa-4306-a9ca-fe6e662bf75f</RequestID></Response>

The only countermeasure here now that comes to my mind is to make the internet_gateway resource aware of https://www.terraform.io/docs/configuration/resources.html#timeouts, which implies an upstream change.

Quentin-M · 2017-06-14T20:46:49Z

Closing here as we have a few retries now. Should track resolution in #1054. SPC engineers are working on implementing the timeout lifecycle on the resource.

This increases the not found checks to 90, which, delayed by 10 seconds, match the total timeout of 15 minutes. Fixes coreos/tectonic-installer#1017

s-urbaniak · 2017-06-30T14:03:51Z

upstream PR: hashicorp/terraform-provider-aws#1021

mxinden · 2017-07-20T15:42:37Z

@s-urbaniak Can we close here as hashicorp/terraform-provider-aws#1021 was merged?

s-urbaniak · 2017-07-20T15:45:17Z

@mxinden definitely, thanks for the catch!

This increases the not found checks to 90, which, delayed by 10 seconds, match the total timeout of 15 minutes. Fixes coreos/tectonic-installer#1017

Quentin-M assigned ggreer, sym3tri, s-urbaniak, kans and alexsomesan Jun 7, 2017

ggreer removed their assignment Jun 7, 2017

Quentin-M mentioned this issue Jun 7, 2017

openstack: make instance names match hostnames #1015

Merged

sym3tri added priority/P0 area/testing labels Jun 9, 2017

sym3tri added this to the Sprint 3: Stability & Test Automation milestone Jun 9, 2017

s-urbaniak mentioned this issue Jun 12, 2017

modules/aws: add dep from aws_eip.nat_eip to aws_internet_gateway.igw #1053

Merged

alexsomesan closed this as completed in #1053 Jun 12, 2017

Quentin-M reopened this Jun 13, 2017

This was referenced Jun 13, 2017

CI failure umbrella issue #1054

Closed

smoke tests should be easily runnable outside of jenkins #1038

Closed

Jenkinsfile: retry destroy on AWS Smoke until #1017/#1054 are fixed #1077

Merged

Quentin-M added a commit that referenced this issue Jun 13, 2017

Merge pull request #1077 from Quentin-M/retry_destroy

5c58588

Jenkinsfile: retry destroy on AWS Smoke until #1017/#1054 are fixed

s-urbaniak added the terraform/upstream label Jun 14, 2017

Quentin-M closed this as completed Jun 14, 2017

s-urbaniak mentioned this issue Jun 23, 2017

Retry timeout parameter for aws_internet_gateway hashicorp/terraform-provider-aws#945

Closed

radeksimko mentioned this issue Jun 30, 2017

r/internet_gateway: Retry properly on DependencyViolation hashicorp/terraform-provider-aws#1021

Merged

s-urbaniak reopened this Jun 30, 2017

s-urbaniak closed this as completed Jul 20, 2017

s-urbaniak added the kind/flake label Jul 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terraform destroy: retry policy #1017

Terraform destroy: retry policy #1017

Quentin-M commented Jun 7, 2017 •

edited

Loading

sym3tri commented Jun 9, 2017

s-urbaniak commented Jun 12, 2017

s-urbaniak commented Jun 12, 2017

Quentin-M commented Jun 13, 2017

s-urbaniak commented Jun 14, 2017

Quentin-M commented Jun 14, 2017

s-urbaniak commented Jun 30, 2017

mxinden commented Jul 20, 2017

s-urbaniak commented Jul 20, 2017

Terraform destroy: retry policy #1017

Terraform destroy: retry policy #1017

Comments

Quentin-M commented Jun 7, 2017 • edited Loading

sym3tri commented Jun 9, 2017

s-urbaniak commented Jun 12, 2017

s-urbaniak commented Jun 12, 2017

Quentin-M commented Jun 13, 2017

s-urbaniak commented Jun 14, 2017

Quentin-M commented Jun 14, 2017

s-urbaniak commented Jun 30, 2017

mxinden commented Jul 20, 2017

s-urbaniak commented Jul 20, 2017

Quentin-M commented Jun 7, 2017 •

edited

Loading