CI/CD pipeline fails semi-randomly, but fails every time #3906

TonyWildish-BH · 2024-04-16T11:58:05Z

Describe the bug
The CI/CD pipeline will not run to successful completion in my environment. I've triggered it manually, >25 times, with a fresh configuration every time, and not one run has completed all the steps successfully.

Sometimes it fails during the actual deployment, other times it deploys, but fails during the E2E tests. Failure modes vary, but are normally some variation on 'connection failed' or timeout.

This is in our fork with no change in the code w.r.t. the upstream repository, and no change of anything other than the config file between runs.

I'm sure this is not normal behaviour, but I have no idea how to address it. Everything is happening between github/azure, no on-prem resources. Even the gh CLI is being launched from an azure VM, so it's hard to see how there can be any network issues contributing to this.

I've attached the zip
failed.zip
of the failed logs, in case that helps.

Steps to reproduce

Populate a config.yaml file with unique values for the TRE id, and the mgmt group, storage account and so on.
make auth to update with new app roles.
Update the secrets/environment variables in the github CICD environment.
Trigger the workflow through the gh CLI.
Wait for the workflow to end, harvest the logfile of failed jobs.
go to 1

Azure TRE release version (e.g. v0.14.0 or main):
main, as of April 10th.

Deployed Azure TRE components - click the (i) in the UI:
n/a

The text was updated successfully, but these errors were encountered:

tim-allen-ck · 2024-04-16T12:42:46Z

Hi @TonyWildish-BH let me look through the log files and get back to you

SvenAelterman · 2024-04-16T13:06:28Z

@TonyWildish-BH I wonder if you've ever tried just re-running the pipeline with the same values/secrets?

TRE is a complex deployment and sometimes things "happen" on the Azure side that cause it to fail.

TonyWildish-BH · 2024-04-16T13:47:08Z

hi @SvenAelterman. Yes, I've done that a few times, and that didn't go through either. In fact, that's why I tried systematically banging away at it, to make sure I had a clean start every time.

As I mention, the errors are quite often of the sort where a retry might help, but I don't expect a pipeline to fail so frequently with that sort of error, so I'm wondering what's going on.

marrobi · 2024-04-17T09:03:26Z

@TonyWildish-BH when you get to step 6 - can I suggest you go back to step 4 - to see if you get a consistent error that we can then troubleshoot? Do not start at the beginning again each time.

As @SvenAelterman says, there are some things that will fail from time to time, but a rerun of the pipeline usually resolves.

TonyWildish-BH · 2024-04-17T10:47:34Z

@marrobi, please see my previous comment, I've tried that a few times, and it's never gone all the way through. I'll try again, just for good measure.

However, even if it did work on a retry, that's missing the point. A CI/CD pipeline that doesn't run reliably is broken, and not fit for purpose. I'm trying to determine if the failure is because of something on our side, or if it's because Azure is fundamentally unreliable, or whatever else could be the cause.

I don't really see how it can be on our side, since everything's happening between github and Azure, but I'm open to that possibility. However, both you and @SvenAelterman seem to be telling me that Azure is unreliable, which I hope is not the case.

SvenAelterman · 2024-04-17T13:56:05Z

I don't mean to give that impression at all. The deployment of Azure TRE is complex with a lot of dependencies and moving parts. Perhaps the TF or pipeline code could be improved to better handle those, et

It's extraordinary, I am sure, to have so many consecutive failures (and in different places nonetheless). However, once the initial deployment is done, subsequent runs of the pipeline are much simpler and much less prone to experiencing issues.

Just curious, have you tried the manual deployment process?

PS: The automated, end-to-end testing performed for pull requests relies on those same pipelines (IIRC), so it's used all the time.

TonyWildish-BH · 2024-04-17T16:25:53Z

Here's my first set of CI/CD retry attempts, and this is a hard fail I've seen before, running manually. The first pass has an unexpected error creating the database locks, subsequent passes fail because the locks are there, but not imported to Terraform:

Attempt #1:

│ Error: Provider produced inconsistent result after apply
│ 
│ When applying changes to azurerm_management_lock.mongo[0], provider "provider[\"registry.terraform.io/hashicorp/azurerm\"]" produced an unexpected new value: Root resource was present, but now absent.
│ 
│ This is a bug in the provider, which should be reported in the provider's own issue tracker.
╵
╷
│ Error: Provider produced inconsistent result after apply
│ 
│ When applying changes to azurerm_management_lock.tre_db[0], provider "provider[\"registry.terraform.io/hashicorp/azurerm\"]" produced an unexpected new value: Root resource was present, but now absent.
│ 
│ This is a bug in the provider, which should be reported in the provider's own issue tracker.
╵
Script done.
Terraform Error
make: *** [Makefile:110: deploy-core] Error 1

Attempt #2:

azurerm_management_lock.tre_db[0]: Creating...
azurerm_management_lock.mongo[0]: Creating...
╷
│ Error: A resource with the ID "/subscriptions/87ad76be-f07f-4c25-b344-9a37c52d9c66/resourceGroups/rg-***/providers/Microsoft.DocumentDB/databaseAccounts/cosmos-mongo-***/mongodbDatabases/porter/providers/Microsoft.Authorization/locks/mongo-lock" already exists - to be managed via Terraform this resource needs to be imported into the State. Please see the resource documentation for "azurerm_management_lock" for more information.
│ 
│   with azurerm_management_lock.mongo[0],
│   on cosmos_mongo.tf line 49, in resource "azurerm_management_lock" "mongo":
│   49: resource "azurerm_management_lock" "mongo" ***
│ 
╵
╷
│ Error: A resource with the ID "/subscriptions/87ad76be-f07f-4c25-b344-9a37c52d9c66/resourceGroups/rg-***/providers/Microsoft.DocumentDB/databaseAccounts/cosmos-***/sqlDatabases/AzureTRE/providers/Microsoft.Authorization/locks/tre-db-lock" already exists - to be managed via Terraform this resource needs to be imported into the State. Please see the resource documentation for "azurerm_management_lock" for more information.
│ 
│   with azurerm_management_lock.tre_db[0],
│   on statestore.tf line 49, in resource "azurerm_management_lock" "tre_db":
│   49: resource "azurerm_management_lock" "tre_db" ***
│ 
╵
Script done.
Terraform Error
make: *** [Makefile:110: deploy-core] Error 1

Attempt #3:

azurerm_management_lock.mongo[0]: Creating...
azurerm_management_lock.tre_db[0]: Creating...
╷
│ Error: A resource with the ID "/subscriptions/87ad76be-f07f-4c25-b344-9a37c52d9c66/resourceGroups/rg-***/providers/Microsoft.DocumentDB/databaseAccounts/cosmos-mongo-***/mongodbDatabases/porter/providers/Microsoft.Authorization/locks/mongo-lock" already exists - to be managed via Terraform this resource needs to be imported into the State. Please see the resource documentation for "azurerm_management_lock" for more information.
│ 
│   with azurerm_management_lock.mongo[0],
│   on cosmos_mongo.tf line 49, in resource "azurerm_management_lock" "mongo":
│   49: resource "azurerm_management_lock" "mongo" ***
│ 
╵
╷
│ Error: A resource with the ID "/subscriptions/87ad76be-f07f-4c25-b344-9a37c52d9c66/resourceGroups/rg-***/providers/Microsoft.DocumentDB/databaseAccounts/cosmos-***/sqlDatabases/AzureTRE/providers/Microsoft.Authorization/locks/tre-db-lock" already exists - to be managed via Terraform this resource needs to be imported into the State. Please see the resource documentation for "azurerm_management_lock" for more information.
│ 
│   with azurerm_management_lock.tre_db[0],
│   on statestore.tf line 49, in resource "azurerm_management_lock" "tre_db":
│   49: resource "azurerm_management_lock" "tre_db" ***
│ 
╵
Script done.
Terraform Error
make: *** [Makefile:110: deploy-core] Error 1
Error: Process completed with exit code 2.

I'll clean up and try again, see what happens if I can get past this, which I often do.

tim-allen-ck · 2024-04-18T09:06:03Z

I've seen the mongo lock error before. Try removing the lock from azure and then rerunning the pipeline to let TF to create the lock.

tim-allen-ck · 2024-05-08T12:12:53Z

@TonyWildish-BH are you still having this issue?

TonyWildish-BH · 2024-05-13T11:02:57Z

Hi @tim-allen-ck. I've abandoned use of the pipeline with no successful resolution, these errors make it unusable for us. We'll have to find some other solution when we come to using the TRE in production.

tim-allen-ck · 2024-05-14T08:18:55Z

Hi @tim-allen-ck. I've abandoned use of the pipeline with no successful resolution, these errors make it unusable for us. We'll have to find some other solution when we come to using the TRE in production.

Hi @TonyWildish-BH we recommend using the deployment repo, this will avoid unnecessary errors with E2E tests.

TonyWildish-BH · 2024-05-14T08:36:23Z

it's not just about the E2E tests, the hard fail above is well before the TRE is fully deployed. If the deployment repo uses the same CI/CD pipeline then that's not going to help.

tim-allen-ck · 2024-08-15T13:43:14Z

Closing; re-open if the problem occurs again.

TonyWildish-BH added the bug Something isn't working label Apr 16, 2024

SvenAelterman added the deployment label Apr 16, 2024

tim-allen-ck closed this as completed Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI/CD pipeline fails semi-randomly, but fails every time #3906

CI/CD pipeline fails semi-randomly, but fails every time #3906

TonyWildish-BH commented Apr 16, 2024

tim-allen-ck commented Apr 16, 2024

SvenAelterman commented Apr 16, 2024

TonyWildish-BH commented Apr 16, 2024

marrobi commented Apr 17, 2024

TonyWildish-BH commented Apr 17, 2024

SvenAelterman commented Apr 17, 2024

TonyWildish-BH commented Apr 17, 2024

tim-allen-ck commented Apr 18, 2024

tim-allen-ck commented May 8, 2024

TonyWildish-BH commented May 13, 2024

tim-allen-ck commented May 14, 2024

TonyWildish-BH commented May 14, 2024

tim-allen-ck commented Aug 15, 2024

CI/CD pipeline fails semi-randomly, but fails every time #3906

CI/CD pipeline fails semi-randomly, but fails every time #3906

Comments

TonyWildish-BH commented Apr 16, 2024

tim-allen-ck commented Apr 16, 2024

SvenAelterman commented Apr 16, 2024

TonyWildish-BH commented Apr 16, 2024

marrobi commented Apr 17, 2024

TonyWildish-BH commented Apr 17, 2024

SvenAelterman commented Apr 17, 2024

TonyWildish-BH commented Apr 17, 2024

tim-allen-ck commented Apr 18, 2024

tim-allen-ck commented May 8, 2024

TonyWildish-BH commented May 13, 2024

tim-allen-ck commented May 14, 2024

TonyWildish-BH commented May 14, 2024

tim-allen-ck commented Aug 15, 2024