-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Docs] Document CI System #14656
[Docs] Document CI System #14656
Changes from 3 commits
c350cc6
ab1958b
18c9689
045f884
c1862b1
2922717
cb494c9
3561795
692ae3e
bddda15
58fafe4
4cbaa0b
85e4a29
61a4a05
dcd1b17
8719841
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,114 @@ | ||||||||||
# The CI Service in Hail Batch | ||||||||||
|
||||||||||
Hail Batch includes a CI service which has three jobs: | ||||||||||
|
||||||||||
- Runs tests against pull requests | ||||||||||
- Merges PRs to the `main` branch | ||||||||||
- Deploys services to the live infrastructure | ||||||||||
|
||||||||||
It does all of these things only if certain conditions are met. | ||||||||||
|
||||||||||
## Deployment | ||||||||||
|
||||||||||
The CI service itself is deployed as a Kubernetes service in the Hail Batch cluster. See | ||||||||||
[architecture diagram](../Hail%20Batch%20Architectural%20Diagram.png). | ||||||||||
|
||||||||||
As part of its configuration, CI must be configured with an access token allowing it to operate on | ||||||||||
behalf of a github account called hail-ci-robot. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. repetition of "configure"
Suggested change
|
||||||||||
|
||||||||||
To make CI work it needs to receive update triggers from the github repository. These are | ||||||||||
configured within the github repository itself. | ||||||||||
|
||||||||||
## Running tests against pull requests | ||||||||||
|
||||||||||
The CI service is constantly monitoring for new PRs in the `hail` repository. | ||||||||||
|
||||||||||
When a new PR is opened, or an existing PR changes, hooks are triggered which alert the CI service. | ||||||||||
The CI service will then: | ||||||||||
|
||||||||||
- Check for validity | ||||||||||
This comment was marked as resolved.
Sorry, something went wrong. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reworded |
||||||||||
- The author must be a known developer (or else a commit SHA can be individually approved via the CI service UI) | ||||||||||
- The PR must be against the `main` branch | ||||||||||
This comment was marked as resolved.
Sorry, something went wrong. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this extra context is super helpful. IMO it's worth including both "how we have it set up" and "what can be configured" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This also makes it sound like CI is more poll-based than trigger-based? I don't have enough permissions in the repo to check what hooks we have set up so some of this is speculative (I should probably have done / go ahead and do a better job of highlighting guesswork) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In fact, the advice from Daniel that "deleting CI's most recent deploy run will make it run again" seems like it must be polling based. I'll look into this today There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's a bit of both. Updated the doc with that info |
||||||||||
- Run tests | ||||||||||
- The CI service will run tests against the PR | ||||||||||
This comment was marked as resolved.
Sorry, something went wrong. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 I'll add this under the details of what the test batch does below. |
||||||||||
- The CI service will report the results of the tests back to GitHub | ||||||||||
|
||||||||||
### Running Tests | ||||||||||
|
||||||||||
The process of running tests goes like: | ||||||||||
|
||||||||||
- CI will generate a new batch and submit it against the production Hail Batch service using its own service credentials | ||||||||||
This comment was marked as resolved.
Sorry, something went wrong. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 I've mentioned this now |
||||||||||
- The batch contains jobs which will: | ||||||||||
- Clone the appropriate branch and build a new set of docker images for the updated services. | ||||||||||
- Deploy the batch suite of k8s services into one of many CI-owned namespaces in the Hail Batch cluster | ||||||||||
- These namespaces are called | ||||||||||
This comment was marked as resolved.
Sorry, something went wrong. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Gotta leave a couple of cliffhangers in here for the next season |
||||||||||
- Run a series of tests against the services | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It also stands up a private instance of the batch services and tests against those! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh i see you mentioned that below. nice. |
||||||||||
- Each test: | ||||||||||
- Submits a specially crafted batch to the newly deployed batch namespace | ||||||||||
- Checks that the results of running the batch are what we would expect | ||||||||||
- The CI service polls the batch which it submits for overall success or failure. | ||||||||||
- Once the batch has either succeeded or failed, CI uses that result to report status back to GitHub | ||||||||||
|
||||||||||
Examples of CI test runs can be seen by searching through the production batch log, as long as you have developer | ||||||||||
permissions. For example: `/batches?q=user+%3D+ci%0D%0Atest+%3D+1` | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You need to be a member of the |
||||||||||
|
||||||||||
#### CI Testing Timeline | ||||||||||
|
||||||||||
The image below shows the CI testing timeline: | ||||||||||
|
||||||||||
![CI Testing Timeline](ci-test.png) | ||||||||||
|
||||||||||
|
||||||||||
## Merging PRs to the `main` branch | ||||||||||
|
||||||||||
When a PR's status changes, it will trigger a hook which alerts the CI service. The CI service will | ||||||||||
determine whether the PR is ready to be merged. If it is, the CI service will merge the pull request. | ||||||||||
|
||||||||||
Readiness is determined by github status checks. The following conditions must be met: | ||||||||||
|
||||||||||
- The PR must be against the `main` branch | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any watched branch, but this is automatically true or else we wouldn't be considering the PR for mergability in the first place... so I'll remove this line |
||||||||||
- The PR must have passed all tests (and there must be none outstanding) | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in gcp. note that the ci tests on azure are not required to merge the pr There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this true? I could have sworn I'd seen PRs with GCP CI green and Azure CI yellow still not merged. As with the comment above... this makes me think I should check whether CI is trigger-based or polling-based There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's true as far as github is concerned. I haven't looked too closely about what ci does with respect to its counterpart in foreign clouds |
||||||||||
- The PR must have sufficient approvals | ||||||||||
|
||||||||||
## Deploying services to the live infrastructure | ||||||||||
|
||||||||||
When a PR is merged into the `main` branch, the CI service will: | ||||||||||
|
||||||||||
- Check for validity | ||||||||||
- The author must be a known developer | ||||||||||
- The PR must be against the `main` branch | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this a copypasta or does ci actually do this? I thought it deployed new commits on main irrespective of the commit author? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It needs to be against a watched branch. Which in hail-is is only |
||||||||||
- Create a new batch job to: | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe this is also |
||||||||||
- Build various components and service images | ||||||||||
- Deploy to a test namespace | ||||||||||
- Run tests against the services | ||||||||||
- Deploy to the `default` (ie prod) namespace | ||||||||||
- A handful of final actions | ||||||||||
- eg rolling out artifact registry cleanup policies, amongst many other things | ||||||||||
|
||||||||||
Note: It's not actually quite as waterfall-y as this. In fact the jobs are all running in a hail | ||||||||||
batch, and each service being deploy has its own path through the DAG. So it's quite possible that the services are | ||||||||||
test/deploy-ing in parallel, and that the deploy for one service might happen before the test for another has even begun. | ||||||||||
|
||||||||||
This should all be fine, because it was previously tested as part of the PR approval process. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we really deploy anything before all test steps have succeeded? My mental model was that we ignore the tests run on the pr pre-merging, and treat it as untested post-merge, because the merge could have introduced semantic conflicts. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a good question. This is an assumption based on -
But I could easily have missed something. Worth double checking There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (Side-note: oh wow, some job groups would be a really! nice addition when trying to work this kind of thing out!) Some selected examples from a recent deploy:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that you can see the dependencies directly in build.yaml. I believe There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To close the loop here: |
||||||||||
|
||||||||||
Examples of CI deploy runs can be seen by searching through the production batch log, as long as you have developer | ||||||||||
permissions. For example: `/batches?q=user+%3D+ci%0D%0Adeploy+%3D+1` | ||||||||||
|
||||||||||
#### CI Deploy Timeline | ||||||||||
|
||||||||||
The image below shows the CI deployment timeline: | ||||||||||
|
||||||||||
![CI Testing Timeline](ci-deploy.png) | ||||||||||
|
||||||||||
## Issues | ||||||||||
|
||||||||||
### Hanging PR state | ||||||||||
|
||||||||||
It is possible for a CI instance to fail when trying to update PR status. In that case, | ||||||||||
the PRs would be hanging in a "pending" state. | ||||||||||
|
||||||||||
This can be fixed by pushing a new commit to the branch. | ||||||||||
|
||||||||||
## References | ||||||||||
|
||||||||||
- Diagrams source: [Link](https://lucid.app/lucidchart/e4fbcce3-5d73-4fa8-a729-eb2a4488d464/edit?viewport_loc=-97%2C228%2C2061%2C1115%2C12R1RM5CUFfr&invitationId=inv_148dfe48-17b2-49c8-bce7-fb06fc92bf48) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.