Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Firewall rules not getting deleted on terraform destroy in GKE #5948

Closed
MeghanaSrinath opened this issue Mar 23, 2020 · 12 comments
Closed

Firewall rules not getting deleted on terraform destroy in GKE #5948

MeghanaSrinath opened this issue Mar 23, 2020 · 12 comments
Assignees
Labels

Comments

@MeghanaSrinath
Copy link

We have used terraform to set up a private GKE cluster along with the VPC and required firewall rules.
However, when we try to delete all the resources with terrafrom destroy, the VPC is not getting deleted and the error is as below:

Error: Error applying plan:

1 error(s) occurred:

* module.vpc.google_compute_network.vpc (destroy): 1 error(s) occurred:

* google_compute_network.vpc: Error waiting for Deleting Network: The network resource 'projects/<project-name>/global/networks/vpc-ea5271e8141d4054a0e1abc7787f6f20' is already being used by 'projects/<project-name>/global/firewalls/k8s-e4198c6a84c32393-node-http-hc'

When I view the firewall in the GCP console, I can see that there are 2 firewall rules created by GKE for allowing master node to communicate with the worker nodes.
This is also stated in the link here.

Due to these firewall rules, the VPC created by terraform cannot get destroyed. Even after deleting the cluster through terraform, these firewall rules remain.
Can someone let us know how can all the firewall rules be destroyed by terraform in order to overcome this issue.

@edwardmedia edwardmedia self-assigned this Mar 23, 2020
@edwardmedia
Copy link
Contributor

edwardmedia commented Mar 23, 2020

@MeghanaSrinath Can you post the code and steps you reached that error? I'd like to repro it on my side

@MeghanaSrinath
Copy link
Author

Hi @edwardmedia

Here are the cluster and network related modules that we have in place:

resource "google_compute_network" "vpc" {
  name = "vpc-${var.client}"
  auto_create_subnetworks = "false"
    }
--------------------------------------
resource "google_compute_subnetwork" "subnet" {
name          = "subnet-${var.client}"
ip_cidr_range = "${var.cidr}"
region        = "${var.region}"
network       = "${var.network}"
}
----------------------------------------
resource "google_compute_firewall" "firewall" {
name    = "firewall-${var.client}"
network = "${var.network}"

allow {
protocol = "tcp"
ports    = ["80", "8080", "1000-2000","22"]
    }

source_ranges = ["<ip range>"]
target_tags = ["bastion-${var.client}"] 
 }

----------------------------------------
resource "google_container_cluster" "cluster" {
name               = "cluster-${var.client}"
location           = "${var.location}"
remove_default_node_pool=true
initial_node_count = "${var.min_node_count}"
network= "${var.network}"
subnetwork= "${var.subnetwork}"

 master_auth {
  client_certificate_config {
  issue_client_certificate = false
    }
  }
 ip_allocation_policy{
  use_ip_aliases =true
    } 
 private_cluster_config { 
   enable_private_endpoint = true 	
   enable_private_nodes    = true 
   master_ipv4_cidr_block  = "192.168.0.0/28"
	  } 
master_authorized_networks_config {
    cidr_blocks {
        cidr_block   = "<ip range>"
        display_name = "my-ip"
          }
 }
    
   }


resource "google_container_node_pool" "nodepool" {
name       = "np-${var.client}"
location   = "${var.location}"
cluster    = "${google_container_cluster.cluster.name}"
initial_node_count = "${var.min_node_count}" 
  
  autoscaling {
    min_node_count = "${var.min_node_count}"
    max_node_count = "${var.max_node_count}"
	      }

node_config {
     machine_type = "${var.node_type}"
     service_account = "${var.email}"
     oauth_scopes = ["cloud-platform","userinfo-email"]
	 image_type ="ubuntu"
	 }
  }
-----------------------------------------

We do have other modules for the bastion VM, NAT and router along with the above modules.
We are using terraform destroy command to delete each of these resources, module wise.
We destroy the bastion VM first, then cluster, NAT, router, firewall, subnet and the VPC in this order.
As such VPC module is the last resource to be deleted and the terraform destroy of this module fails with the below error:

Error: Error applying plan:

1 error(s) occurred:

* module.vpc.google_compute_network.vpc (destroy): 1 error(s) occurred:

* google_compute_network.vpc: Error waiting for Deleting Network: The network resource 'projects/<project>/global/networks/vpc-ea5271e8141d4054a0e1abc7787f6f20' is already being used by 'projects/<project>/global/firewalls/k8s-e4198c6a84c32393-node-http-hc'

On looking at the GCP console, we can see that 2 firewall rules will not be deleted.
One of the firewall rule - k8s-e4198c6a84c32393-node-http-hc is for the port 10256 and has the description as - {"kubernetes.io/cluster-id":"e4198c6a84c32393"}

The other firewall rule - k8s-fw-a4b5d3a346c5d11eab3bd4201c0a8000 is due to service exposed a load balancer in the cluster and has the description as
{"kubernetes.io/service-name":"ethan/nginx", "kubernetes.io/service-ip":"35.244.36.34"}

But there are other firewall rules created due to our application running in the cluster and which are successfully deleted on terraform destroy as well.
Is there any reason why these two particular firewall rules arent destroyed by terraform?

@edwardmedia
Copy link
Contributor

@MeghanaSrinath I can't repro the issue. Not sure what you did after the gke created which may impact some behaviors. Can you try to run tf destroy, right after apply is complete? If you can repro the issue, can you post full debug logs for both apply and destroy?

@MeghanaSrinath
Copy link
Author

MeghanaSrinath commented Apr 8, 2020

@edwardmedia , we did tf destroy soon after tf apply and we still had the issue.
These left over firewalls are due to the load balancers that get created while we deploy the microservices application in our GKE cluster. Each of these microservices need a LB to function. What we don't understand is that why these resources are not getting deleted after the cluster is destroyed.

@ghost ghost removed the waiting-response label Apr 8, 2020
@edwardmedia
Copy link
Contributor

@MeghanaSrinath between tf destroy soon after tf apply, did you deploy the microservice applications to the cluster? You mentioned they need a LB to function. What happened to the function? I don't have the full picture of all your infrastructure. It is hard to guess. When I tried repro, I was using your above code. Have you tried to use same code to see if you can repro the issue?

@MeghanaSrinath
Copy link
Author

@edwardmedia , Sorry for the delay in response. Yes, we are creating a LB in our cluster. So this is our use case-
We create a cluster using above modules and then we create a VM using terraform. Now through this VM, we connect to the cluster. We have written our application logic in the remote_exec part of the VM. So basically the idea here to to connect to the cluster through a VM and remotely execute certain operations related to our app. So during this operation, a load balancer gets created and we access our app through this LB. The problem here is during tf destroy. There are few firewall rules that are auto created by GKE for this LB and these are not getting deleted. Other firewall rules auto created by GKE are deleted, but not the ones due to a LB. Hope I have clarified the details a bit. Please let me know in case more details are needed.

@ghost ghost removed the waiting-response label Apr 19, 2020
@edwardmedia
Copy link
Contributor

edwardmedia commented Apr 20, 2020

@MeghanaSrinath Your above code works fine with me. I am not able to hit your error. You have mentioned other resources in this issue. Likely this issue is related to them. Can you provide exact detail steps and code that I can follow in order to repro the issue? Also please post full apply and destroy debug logs

@MeghanaSrinath
Copy link
Author

Hi @edwardmedia
Unfortunately, I wont be able to provide the complete terrafrom apply logs as it contains our application logic in the remote execution part of the bastion instance. However, I can say the resources that we create in terraform.

  1. Create VPC, subnetworks.
  2. Create a private cluster in the subnet.
  3. Create a Bastion instance.
    3a. Connect to the cluster in the bastion remote_exec.
    3b. Execute the commands in the remote_Exec part: Here is where we create a load balancer to accesss our app. And a firewall rule gets created for the LB.
  4. Terraform destroy bastion instance.
  5. Terraform destroy cluster.
  6. Terraform destroy subent.
  7. Terraform destroy VPC: At this stage, destroy fails with the error mentioned.
    What we have observed is that the firewall rules created in step 3b is not deleted when we destroy cluster in step 5.
    We wanted to know if this behaviour is as expected.

@ghost ghost removed the waiting-response label Apr 24, 2020
@edwardmedia
Copy link
Contributor

edwardmedia commented Apr 24, 2020

@MeghanaSrinath Without seeing the complete code, I am not sure. But I see a problem here. The LB and firewall rules are created outside terraform. Do you import their state somewhere? If not, terraform has no knowledge about their existence, and will not destroy them automatically. Because they likely have dependencies on the resources you creates in pre-3d. steps, You need to delete these manually created resources before next steps. Does this make sense?

@MeghanaSrinath
Copy link
Author

HI @edwardmedia , I agree on this. Terraform doesn't have a knowledge about these resources. Then we better find a way to delete these resources before we do a terraform destroy.You also mentioned about importing the state? Can you please explain more about that? Is it possible to have a state file for non-terraform created resources?

@ghost ghost removed the waiting-response label Apr 25, 2020
@edwardmedia
Copy link
Contributor

@MeghanaSrinath sure yes, you can import state from non-terraform created resources. Most resources now support import. As an example, here is the command for google_compute_firewall https://www.terraform.io/docs/providers/google/r/compute_firewall.html#import

You might also want to review terraform import in general. https://www.terraform.io/docs/import/usage.html

For now, I am closing this issue. Feel free to reopen it if you still see an issue.

@ghost
Copy link

ghost commented May 27, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 hashibot-feedback@hashicorp.com. Thanks!

@ghost ghost locked and limited conversation to collaborators May 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants