Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Document CI System #14656

Merged
merged 16 commits into from
Sep 26, 2024
114 changes: 114 additions & 0 deletions dev-docs/services/ci/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# The CI Service in Hail Batch

Hail Batch includes a CI service which has three jobs:

- Runs tests against pull requests
- Merges PRs to the `main` branch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Merges PRs to the `main` branch
- Merges PRs to the `main` branch when all checks have passed

- Deploys services to the live infrastructure

It does all of these things only if certain conditions are met.

## Deployment

The CI service itself is deployed as a Kubernetes service in the Hail Batch cluster. See
[architecture diagram](../Hail%20Batch%20Architectural%20Diagram.png).

As part of its configuration, CI must be configured with an access token allowing it to operate on
behalf of a github account called hail-ci-robot.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repetition of "configure"

Suggested change
As part of its configuration, CI must be configured with an access token allowing it to operate on
behalf of a github account called hail-ci-robot.
CI must be configured with an access token allowing it to operate on
behalf of the 'hail-ci-robot' github account.


To make CI work it needs to receive update triggers from the github repository. These are
configured within the github repository itself.

## Running tests against pull requests

The CI service is constantly monitoring for new PRs in the `hail` repository.

When a new PR is opened, or an existing PR changes, hooks are triggered which alert the CI service.
The CI service will then:

- Check for validity

This comment was marked as resolved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded

- The author must be a known developer (or else a commit SHA can be individually approved via the CI service UI)
- The PR must be against the `main` branch

This comment was marked as resolved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this extra context is super helpful. IMO it's worth including both "how we have it set up" and "what can be configured"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also makes it sound like CI is more poll-based than trigger-based? I don't have enough permissions in the repo to check what hooks we have set up so some of this is speculative (I should probably have done / go ahead and do a better job of highlighting guesswork)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, the advice from Daniel that "deleting CI's most recent deploy run will make it run again" seems like it must be polling based. I'll look into this today

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit of both. Updated the doc with that info

- Run tests
- The CI service will run tests against the PR

This comment was marked as resolved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I'll add this under the details of what the test batch does below.

- The CI service will report the results of the tests back to GitHub

### Running Tests

The process of running tests goes like:

- CI will generate a new batch and submit it against the production Hail Batch service using its own service credentials

This comment was marked as resolved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I've mentioned this now

- The batch contains jobs which will:
- Clone the appropriate branch and build a new set of docker images for the updated services.
- Deploy the batch suite of k8s services into one of many CI-owned namespaces in the Hail Batch cluster
- These namespaces are called

This comment was marked as resolved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotta leave a couple of cliffhangers in here for the next season

- Run a series of tests against the services
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also stands up a private instance of the batch services and tests against those!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh i see you mentioned that below. nice.

- Each test:
- Submits a specially crafted batch to the newly deployed batch namespace
- Checks that the results of running the batch are what we would expect
- The CI service polls the batch which it submits for overall success or failure.
- Once the batch has either succeeded or failed, CI uses that result to report status back to GitHub

Examples of CI test runs can be seen by searching through the production batch log, as long as you have developer
permissions. For example: `/batches?q=user+%3D+ci%0D%0Atest+%3D+1`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to be a member of the ci billing project i believe.


#### CI Testing Timeline

The image below shows the CI testing timeline:

![CI Testing Timeline](ci-test.png)


## Merging PRs to the `main` branch

When a PR's status changes, it will trigger a hook which alerts the CI service. The CI service will
determine whether the PR is ready to be merged. If it is, the CI service will merge the pull request.

Readiness is determined by github status checks. The following conditions must be met:

- The PR must be against the `main` branch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it main, or any watched branch? Just since you started out making the distinction carefully, we should maintain it throughout.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any watched branch, but this is automatically true or else we wouldn't be considering the PR for mergability in the first place... so I'll remove this line

- The PR must have passed all tests (and there must be none outstanding)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in gcp. note that the ci tests on azure are not required to merge the pr

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? I could have sworn I'd seen PRs with GCP CI green and Azure CI yellow still not merged. As with the comment above... this makes me think I should check whether CI is trigger-based or polling-based

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true as far as github is concerned. I haven't looked too closely about what ci does with respect to its counterpart in foreign clouds

- The PR must have sufficient approvals

## Deploying services to the live infrastructure

When a PR is merged into the `main` branch, the CI service will:

- Check for validity
- The author must be a known developer
- The PR must be against the `main` branch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a copypasta or does ci actually do this? I thought it deployed new commits on main irrespective of the commit author?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be against a watched branch. Which in hail-is is only main. Doesn't matter who the PR originator was.

- Create a new batch job to:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is also build.yaml, just with a configuration variable set to make it deploy?

- Build various components and service images
- Deploy to a test namespace
- Run tests against the services
- Deploy to the `default` (ie prod) namespace
- A handful of final actions
- eg rolling out artifact registry cleanup policies, amongst many other things

Note: It's not actually quite as waterfall-y as this. In fact the jobs are all running in a hail
batch, and each service being deploy has its own path through the DAG. So it's quite possible that the services are
test/deploy-ing in parallel, and that the deploy for one service might happen before the test for another has even begun.

This should all be fine, because it was previously tested as part of the PR approval process.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really deploy anything before all test steps have succeeded? My mental model was that we ignore the tests run on the pr pre-merging, and treat it as untested post-merge, because the merge could have introduced semantic conflicts.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question. This is an assumption based on -

  1. The batch lists the tasks in service-by-service order
  2. The "deploy batch" job is listed before "test batch" jobs
  3. There only seems to be one "deploy batch" job and it seems to target the default namespace

But I could easily have missed something. Worth double checking

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Side-note: oh wow, some job groups would be a really! nice addition when trying to work this kind of thing out!)

Some selected examples from a recent deploy:

  • Job 27: batch_image ran from 21:58 to 22:00
  • Job 39: test_hail_java_0 ran from 22:12 to 22:17
  • Job 44: test_hail_python_0 ran from 22:12 to 22:18
  • Job 78: test_hail_python_local_backend_0 ran from 22:15 to 22:21
  • Job 153 deploy_batch ran from 22:04 to 22:05
  • Job 176: test_batch_0 ran from 22:15 to 22:23
  • Job 182: deploy_ci ran from 22:05 to 22:05

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that you can see the dependencies directly in build.yaml. I believe deploy_batch is what deploys the new batch service in the test namespace, as test_batch depends on it. I'm not sure which job actually deploys to production.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To close the loop here: deploy_batch does deploy to production (in the default namespace), which happens before tests are run. But, CI will only merge a PR if it has a successful CI run on top of the current head of main. That means the commit that will be merged and deployed is exactly the same (with the same git sha) as the commit that passed CI. There are still a few small holes for issues to slip past the tests, in particular because CI runs the batch tests in the test namespace in a fresh deployment, while the tests after deploying are run in the default namespace against the production system.


Examples of CI deploy runs can be seen by searching through the production batch log, as long as you have developer
permissions. For example: `/batches?q=user+%3D+ci%0D%0Adeploy+%3D+1`

#### CI Deploy Timeline

The image below shows the CI deployment timeline:

![CI Testing Timeline](ci-deploy.png)

## Issues

### Hanging PR state

It is possible for a CI instance to fail when trying to update PR status. In that case,
the PRs would be hanging in a "pending" state.

This can be fixed by pushing a new commit to the branch.

## References

- Diagrams source: [Link](https://lucid.app/lucidchart/e4fbcce3-5d73-4fa8-a729-eb2a4488d464/edit?viewport_loc=-97%2C228%2C2061%2C1115%2C12R1RM5CUFfr&invitationId=inv_148dfe48-17b2-49c8-bce7-fb06fc92bf48)
Binary file added dev-docs/services/ci/ci-deploy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added dev-docs/services/ci/ci-test.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading