Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI/CD pipeline fails semi-randomly, but fails every time #3906

Closed
TonyWildish-BH opened this issue Apr 16, 2024 · 13 comments
Closed

CI/CD pipeline fails semi-randomly, but fails every time #3906

TonyWildish-BH opened this issue Apr 16, 2024 · 13 comments
Labels
bug Something isn't working deployment

Comments

@TonyWildish-BH
Copy link

Describe the bug
The CI/CD pipeline will not run to successful completion in my environment. I've triggered it manually, >25 times, with a fresh configuration every time, and not one run has completed all the steps successfully.

Sometimes it fails during the actual deployment, other times it deploys, but fails during the E2E tests. Failure modes vary, but are normally some variation on 'connection failed' or timeout.

This is in our fork with no change in the code w.r.t. the upstream repository, and no change of anything other than the config file between runs.

I'm sure this is not normal behaviour, but I have no idea how to address it. Everything is happening between github/azure, no on-prem resources. Even the gh CLI is being launched from an azure VM, so it's hard to see how there can be any network issues contributing to this.

I've attached the zip
failed.zip
of the failed logs, in case that helps.

Steps to reproduce

  1. Populate a config.yaml file with unique values for the TRE id, and the mgmt group, storage account and so on.
  2. make auth to update with new app roles.
  3. Update the secrets/environment variables in the github CICD environment.
  4. Trigger the workflow through the gh CLI.
  5. Wait for the workflow to end, harvest the logfile of failed jobs.
  6. go to 1

Azure TRE release version (e.g. v0.14.0 or main):
main, as of April 10th.

Deployed Azure TRE components - click the (i) in the UI:
n/a

@TonyWildish-BH TonyWildish-BH added the bug Something isn't working label Apr 16, 2024
@tim-allen-ck
Copy link
Collaborator

Hi @TonyWildish-BH let me look through the log files and get back to you

@SvenAelterman
Copy link
Collaborator

@TonyWildish-BH I wonder if you've ever tried just re-running the pipeline with the same values/secrets?

TRE is a complex deployment and sometimes things "happen" on the Azure side that cause it to fail.

@TonyWildish-BH
Copy link
Author

hi @SvenAelterman. Yes, I've done that a few times, and that didn't go through either. In fact, that's why I tried systematically banging away at it, to make sure I had a clean start every time.

As I mention, the errors are quite often of the sort where a retry might help, but I don't expect a pipeline to fail so frequently with that sort of error, so I'm wondering what's going on.

@marrobi
Copy link
Member

marrobi commented Apr 17, 2024

@TonyWildish-BH when you get to step 6 - can I suggest you go back to step 4 - to see if you get a consistent error that we can then troubleshoot? Do not start at the beginning again each time.

As @SvenAelterman says, there are some things that will fail from time to time, but a rerun of the pipeline usually resolves.

@TonyWildish-BH
Copy link
Author

@marrobi, please see my previous comment, I've tried that a few times, and it's never gone all the way through. I'll try again, just for good measure.

However, even if it did work on a retry, that's missing the point. A CI/CD pipeline that doesn't run reliably is broken, and not fit for purpose. I'm trying to determine if the failure is because of something on our side, or if it's because Azure is fundamentally unreliable, or whatever else could be the cause.

I don't really see how it can be on our side, since everything's happening between github and Azure, but I'm open to that possibility. However, both you and @SvenAelterman seem to be telling me that Azure is unreliable, which I hope is not the case.

@SvenAelterman
Copy link
Collaborator

I don't mean to give that impression at all. The deployment of Azure TRE is complex with a lot of dependencies and moving parts. Perhaps the TF or pipeline code could be improved to better handle those, et

It's extraordinary, I am sure, to have so many consecutive failures (and in different places nonetheless). However, once the initial deployment is done, subsequent runs of the pipeline are much simpler and much less prone to experiencing issues.

Just curious, have you tried the manual deployment process?

PS: The automated, end-to-end testing performed for pull requests relies on those same pipelines (IIRC), so it's used all the time.

@TonyWildish-BH
Copy link
Author

Here's my first set of CI/CD retry attempts, and this is a hard fail I've seen before, running manually. The first pass has an unexpected error creating the database locks, subsequent passes fail because the locks are there, but not imported to Terraform:

Attempt #1:

│ Error: Provider produced inconsistent result after apply
│ 
│ When applying changes to azurerm_management_lock.mongo[0], provider "provider[\"registry.terraform.io/hashicorp/azurerm\"]" produced an unexpected new value: Root resource was present, but now absent.
│ 
│ This is a bug in the provider, which should be reported in the provider's own issue tracker.
╵
╷
│ Error: Provider produced inconsistent result after apply
│ 
│ When applying changes to azurerm_management_lock.tre_db[0], provider "provider[\"registry.terraform.io/hashicorp/azurerm\"]" produced an unexpected new value: Root resource was present, but now absent.
│ 
│ This is a bug in the provider, which should be reported in the provider's own issue tracker.
╵
Script done.
Terraform Error
make: *** [Makefile:110: deploy-core] Error 1

Attempt #2:

azurerm_management_lock.tre_db[0]: Creating...
azurerm_management_lock.mongo[0]: Creating...
╷
│ Error: A resource with the ID "/subscriptions/87ad76be-f07f-4c25-b344-9a37c52d9c66/resourceGroups/rg-***/providers/Microsoft.DocumentDB/databaseAccounts/cosmos-mongo-***/mongodbDatabases/porter/providers/Microsoft.Authorization/locks/mongo-lock" already exists - to be managed via Terraform this resource needs to be imported into the State. Please see the resource documentation for "azurerm_management_lock" for more information.
│ 
│   with azurerm_management_lock.mongo[0],
│   on cosmos_mongo.tf line 49, in resource "azurerm_management_lock" "mongo":
│   49: resource "azurerm_management_lock" "mongo" ***
│ 
╵
╷
│ Error: A resource with the ID "/subscriptions/87ad76be-f07f-4c25-b344-9a37c52d9c66/resourceGroups/rg-***/providers/Microsoft.DocumentDB/databaseAccounts/cosmos-***/sqlDatabases/AzureTRE/providers/Microsoft.Authorization/locks/tre-db-lock" already exists - to be managed via Terraform this resource needs to be imported into the State. Please see the resource documentation for "azurerm_management_lock" for more information.
│ 
│   with azurerm_management_lock.tre_db[0],
│   on statestore.tf line 49, in resource "azurerm_management_lock" "tre_db":
│   49: resource "azurerm_management_lock" "tre_db" ***
│ 
╵
Script done.
Terraform Error
make: *** [Makefile:110: deploy-core] Error 1

Attempt #3:

azurerm_management_lock.mongo[0]: Creating...
azurerm_management_lock.tre_db[0]: Creating...
╷
│ Error: A resource with the ID "/subscriptions/87ad76be-f07f-4c25-b344-9a37c52d9c66/resourceGroups/rg-***/providers/Microsoft.DocumentDB/databaseAccounts/cosmos-mongo-***/mongodbDatabases/porter/providers/Microsoft.Authorization/locks/mongo-lock" already exists - to be managed via Terraform this resource needs to be imported into the State. Please see the resource documentation for "azurerm_management_lock" for more information.
│ 
│   with azurerm_management_lock.mongo[0],
│   on cosmos_mongo.tf line 49, in resource "azurerm_management_lock" "mongo":
│   49: resource "azurerm_management_lock" "mongo" ***
│ 
╵
╷
│ Error: A resource with the ID "/subscriptions/87ad76be-f07f-4c25-b344-9a37c52d9c66/resourceGroups/rg-***/providers/Microsoft.DocumentDB/databaseAccounts/cosmos-***/sqlDatabases/AzureTRE/providers/Microsoft.Authorization/locks/tre-db-lock" already exists - to be managed via Terraform this resource needs to be imported into the State. Please see the resource documentation for "azurerm_management_lock" for more information.
│ 
│   with azurerm_management_lock.tre_db[0],
│   on statestore.tf line 49, in resource "azurerm_management_lock" "tre_db":
│   49: resource "azurerm_management_lock" "tre_db" ***
│ 
╵
Script done.
Terraform Error
make: *** [Makefile:110: deploy-core] Error 1
Error: Process completed with exit code 2.

I'll clean up and try again, see what happens if I can get past this, which I often do.

@tim-allen-ck
Copy link
Collaborator

I've seen the mongo lock error before. Try removing the lock from azure and then rerunning the pipeline to let TF to create the lock.

@tim-allen-ck
Copy link
Collaborator

@TonyWildish-BH are you still having this issue?

@TonyWildish-BH
Copy link
Author

Hi @tim-allen-ck. I've abandoned use of the pipeline with no successful resolution, these errors make it unusable for us. We'll have to find some other solution when we come to using the TRE in production.

@tim-allen-ck
Copy link
Collaborator

Hi @tim-allen-ck. I've abandoned use of the pipeline with no successful resolution, these errors make it unusable for us. We'll have to find some other solution when we come to using the TRE in production.

Hi @TonyWildish-BH we recommend using the deployment repo, this will avoid unnecessary errors with E2E tests.

@TonyWildish-BH
Copy link
Author

it's not just about the E2E tests, the hard fail above is well before the TRE is fully deployed. If the deployment repo uses the same CI/CD pipeline then that's not going to help.

@tim-allen-ck
Copy link
Collaborator

Closing; re-open if the problem occurs again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working deployment
Projects
None yet
Development

No branches or pull requests

4 participants