-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI/CD pipeline fails semi-randomly, but fails every time #3906
Comments
Hi @TonyWildish-BH let me look through the log files and get back to you |
@TonyWildish-BH I wonder if you've ever tried just re-running the pipeline with the same values/secrets? TRE is a complex deployment and sometimes things "happen" on the Azure side that cause it to fail. |
hi @SvenAelterman. Yes, I've done that a few times, and that didn't go through either. In fact, that's why I tried systematically banging away at it, to make sure I had a clean start every time. As I mention, the errors are quite often of the sort where a retry might help, but I don't expect a pipeline to fail so frequently with that sort of error, so I'm wondering what's going on. |
@TonyWildish-BH when you get to step 6 - can I suggest you go back to step 4 - to see if you get a consistent error that we can then troubleshoot? Do not start at the beginning again each time. As @SvenAelterman says, there are some things that will fail from time to time, but a rerun of the pipeline usually resolves. |
@marrobi, please see my previous comment, I've tried that a few times, and it's never gone all the way through. I'll try again, just for good measure. However, even if it did work on a retry, that's missing the point. A CI/CD pipeline that doesn't run reliably is broken, and not fit for purpose. I'm trying to determine if the failure is because of something on our side, or if it's because Azure is fundamentally unreliable, or whatever else could be the cause. I don't really see how it can be on our side, since everything's happening between github and Azure, but I'm open to that possibility. However, both you and @SvenAelterman seem to be telling me that Azure is unreliable, which I hope is not the case. |
I don't mean to give that impression at all. The deployment of Azure TRE is complex with a lot of dependencies and moving parts. Perhaps the TF or pipeline code could be improved to better handle those, et It's extraordinary, I am sure, to have so many consecutive failures (and in different places nonetheless). However, once the initial deployment is done, subsequent runs of the pipeline are much simpler and much less prone to experiencing issues. Just curious, have you tried the manual deployment process? PS: The automated, end-to-end testing performed for pull requests relies on those same pipelines (IIRC), so it's used all the time. |
Here's my first set of CI/CD retry attempts, and this is a hard fail I've seen before, running manually. The first pass has an unexpected error creating the database locks, subsequent passes fail because the locks are there, but not imported to Terraform: Attempt #1:
Attempt #2:
Attempt #3:
I'll clean up and try again, see what happens if I can get past this, which I often do. |
I've seen the mongo lock error before. Try removing the lock from azure and then rerunning the pipeline to let TF to create the lock. |
@TonyWildish-BH are you still having this issue? |
Hi @tim-allen-ck. I've abandoned use of the pipeline with no successful resolution, these errors make it unusable for us. We'll have to find some other solution when we come to using the TRE in production. |
Hi @TonyWildish-BH we recommend using the deployment repo, this will avoid unnecessary errors with E2E tests. |
it's not just about the E2E tests, the hard fail above is well before the TRE is fully deployed. If the deployment repo uses the same CI/CD pipeline then that's not going to help. |
Closing; re-open if the problem occurs again. |
Describe the bug
The CI/CD pipeline will not run to successful completion in my environment. I've triggered it manually, >25 times, with a fresh configuration every time, and not one run has completed all the steps successfully.
Sometimes it fails during the actual deployment, other times it deploys, but fails during the E2E tests. Failure modes vary, but are normally some variation on 'connection failed' or timeout.
This is in our fork with no change in the code w.r.t. the upstream repository, and no change of anything other than the config file between runs.
I'm sure this is not normal behaviour, but I have no idea how to address it. Everything is happening between github/azure, no on-prem resources. Even the
gh
CLI is being launched from an azure VM, so it's hard to see how there can be any network issues contributing to this.I've attached the zip
failed.zip
of the failed logs, in case that helps.
Steps to reproduce
config.yaml
file with unique values for the TRE id, and the mgmt group, storage account and so on.make auth
to update with new app roles.gh
CLI.Azure TRE release version (e.g. v0.14.0 or main):
main, as of April 10th.
Deployed Azure TRE components - click the (i) in the UI:
n/a
The text was updated successfully, but these errors were encountered: