Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Document CI System #14656

Merged
merged 16 commits into from
Sep 26, 2024
Merged

Conversation

cjllanwarne
Copy link
Collaborator

@cjllanwarne cjllanwarne commented Aug 8, 2024

Fixes #14657

An initial set of documentation for how our CI system works.

Documentation only, no production change.

Here's a link to a rendered version of the markdown.

Note to reviewers - please (please!) check the details to make sure I got this right!

Copy link
Member

@ehigham ehigham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking the time to do this. These documents are so so helpful. I'm very impressed how much you've understood in such a short time. I have a few comments, some that add context and one or two puzzles.

Your diagrams are brilliant. Would it be possible to use svg instead of png please? It gets very grainy when i zoom in!

When a new PR is opened, or an existing PR changes, hooks are triggered which alert the CI service.
The CI service will then:

- Check for validity

This comment was marked as resolved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded


- Check for validity
- The author must be a known developer (or else a commit SHA can be individually approved via the CI service UI)
- The PR must be against the `main` branch

This comment was marked as resolved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this extra context is super helpful. IMO it's worth including both "how we have it set up" and "what can be configured"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also makes it sound like CI is more poll-based than trigger-based? I don't have enough permissions in the repo to check what hooks we have set up so some of this is speculative (I should probably have done / go ahead and do a better job of highlighting guesswork)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, the advice from Daniel that "deleting CI's most recent deploy run will make it run again" seems like it must be polling based. I'll look into this today

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit of both. Updated the doc with that info

- The author must be a known developer (or else a commit SHA can be individually approved via the CI service UI)
- The PR must be against the `main` branch
- Run tests
- The CI service will run tests against the PR

This comment was marked as resolved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I'll add this under the details of what the test batch does below.


The process of running tests goes like:

- CI will generate a new batch and submit it against the production Hail Batch service using its own service credentials

This comment was marked as resolved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I've mentioned this now

- The batch contains jobs which will:
- Clone the appropriate branch and build a new set of docker images for the updated services.
- Deploy the batch suite of k8s services into one of many CI-owned namespaces in the Hail Batch cluster
- These namespaces are called

This comment was marked as resolved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotta leave a couple of cliffhangers in here for the next season

- Clone the appropriate branch and build a new set of docker images for the updated services.
- Deploy the batch suite of k8s services into one of many CI-owned namespaces in the Hail Batch cluster
- These namespaces are called
- Run a series of tests against the services
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also stands up a private instance of the batch services and tests against those!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh i see you mentioned that below. nice.

Comment on lines 52 to 53
Examples of CI test runs can be seen by searching through the production batch log, as long as you have developer
permissions. For example: `/batches?q=user+%3D+ci%0D%0Atest+%3D+1`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to be a member of the ci billing project i believe.

Readiness is determined by github status checks. The following conditions must be met:

- The PR must be against the `main` branch
- The PR must have passed all tests (and there must be none outstanding)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in gcp. note that the ci tests on azure are not required to merge the pr

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? I could have sworn I'd seen PRs with GCP CI green and Azure CI yellow still not merged. As with the comment above... this makes me think I should check whether CI is trigger-based or polling-based

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true as far as github is concerned. I haven't looked too closely about what ci does with respect to its counterpart in foreign clouds

Comment on lines 77 to 79
- Check for validity
- The author must be a known developer
- The PR must be against the `main` branch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a copypasta or does ci actually do this? I thought it deployed new commits on main irrespective of the commit author?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be against a watched branch. Which in hail-is is only main. Doesn't matter who the PR originator was.

Comment on lines 88 to 92
Note: It's not actually quite as waterfall-y as this. In fact the jobs are all running in a hail
batch, and each service being deploy has its own path through the DAG. So it's quite possible that the services are
test/deploy-ing in parallel, and that the deploy for one service might happen before the test for another has even begun.

This should all be fine, because it was previously tested as part of the PR approval process.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really deploy anything before all test steps have succeeded? My mental model was that we ignore the tests run on the pr pre-merging, and treat it as untested post-merge, because the merge could have introduced semantic conflicts.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question. This is an assumption based on -

  1. The batch lists the tasks in service-by-service order
  2. The "deploy batch" job is listed before "test batch" jobs
  3. There only seems to be one "deploy batch" job and it seems to target the default namespace

But I could easily have missed something. Worth double checking

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Side-note: oh wow, some job groups would be a really! nice addition when trying to work this kind of thing out!)

Some selected examples from a recent deploy:

  • Job 27: batch_image ran from 21:58 to 22:00
  • Job 39: test_hail_java_0 ran from 22:12 to 22:17
  • Job 44: test_hail_python_0 ran from 22:12 to 22:18
  • Job 78: test_hail_python_local_backend_0 ran from 22:15 to 22:21
  • Job 153 deploy_batch ran from 22:04 to 22:05
  • Job 176: test_batch_0 ran from 22:15 to 22:23
  • Job 182: deploy_ci ran from 22:05 to 22:05

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that you can see the dependencies directly in build.yaml. I believe deploy_batch is what deploys the new batch service in the test namespace, as test_batch depends on it. I'm not sure which job actually deploys to production.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To close the loop here: deploy_batch does deploy to production (in the default namespace), which happens before tests are run. But, CI will only merge a PR if it has a successful CI run on top of the current head of main. That means the commit that will be merged and deployed is exactly the same (with the same git sha) as the commit that passed CI. There are still a few small holes for issues to slip past the tests, in particular because CI runs the batch tests in the test namespace in a fresh deployment, while the tests after deploying are run in the default namespace against the production system.

Copy link
Member

@ehigham ehigham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diagrams are great. thanks for your effort. a couple of small things.

Comment on lines 1 to 3
# The CI Service in Hail Batch

Hail Batch includes a CI service which has three functional purposes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is boarderline pedantry, but I don't think of "Hail Batch" being anything other than the batch service and the python client library. the batch service + ci etc are generally referred to as the "Services".

Suggested change
# The CI Service in Hail Batch
Hail Batch includes a CI service which has three functional purposes:
# The CI Service
The CI service has three functional purposes:

Hail Batch includes a CI service which has three functional purposes:

- Runs tests against pull requests
- Merges PRs to the `main` branch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Merges PRs to the `main` branch
- Merges PRs to the `main` branch when all checks have passed

Comment on lines 14 to 15
As part of its configuration, CI must be configured with an access token allowing it to operate on
behalf of a github account called hail-ci-robot.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repetition of "configure"

Suggested change
As part of its configuration, CI must be configured with an access token allowing it to operate on
behalf of a github account called hail-ci-robot.
CI must be configured with an access token allowing it to operate on
behalf of the 'hail-ci-robot' github account.

- The state of the deployed system ("`batch_changed`")
- The state of the watched branch itself (for example - its current commit) ("`state_changed`")

In hail-is, there is exactly one watched branch, which is `hail-is/hail:main`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make more sense to describe the deployment model before mentioning the organisation. Reading this, i was like "what does hail-is have to do with this". Alternatively, say something along the lines of "the default-namespace deployment of ci currently tracks hail-is/hail:main".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I don't particularly like documenting the current values of config variables as these go stale quickly. I'd link to the file where interested readers can find current values.

Copy link
Collaborator Author

@cjllanwarne cjllanwarne Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I can move it and reword it. I want to strike a balance between "documenting something very general purpose" and "explaining how it is being used by us". I'll try again

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This document will be useful for understanding how our CI deployment works on hail-is/hail::main. I think whether some behavior is hard-coded or controlled by configuration is of secondary importance. If some config variable changes frequently, then agreed we should avoid talking about its current setting, but I don't think that's the case here.


### Github Configuration

To make CI more responsive it can be configured to receive webhook event triggers from the github repository.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To make CI more responsive it can be configured to receive webhook event triggers from the github repository.
To make CI more responsive it is configured to receive webhook event triggers from the github repository.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this is an accident of trying to be both a general purpose doc ("can be") vs how we use it ("is")

- Tasks in the batch are defined in `build.yaml` in the top repo directory. CI generates jobs in the batch from the steps defined in there.
- The batch contains jobs which will:
- Clone the appropriate branch
- Squash and rebase against `main`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more pedantry

Suggested change
- Squash and rebase against `main`
- Squash and rebase onto `main`

- The batch contains jobs which will:
- Clone the appropriate branch
- Squash and rebase against `main`
- Build a new set of docker images for the updated services.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also builds and tests images for hail query

Readiness is determined by github status checks. The following conditions must be met:

- The PR must be against the `main` branch
- The PR must have passed all tests in GCP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The PR must have passed all tests in GCP
- The PR must have passed all required checks

Comment on lines 242 to 244
2. Do we really deploy first / test second? And do we really deploy Batch and CI using jobs that are already running in
Batch and CI? Do the services get shut down and reconnect to existing instances of the deploy jobs started by the
previous version?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't quite understand the question.
There's one deploy job at a time that deploys already tested artefacts (albeit, tested in a different namespace).
k8s handles updating the various pods with their new images.
Sometimes, the services are brought down depending on database migrations. I think that's defined in build.yaml.

Copy link
Collaborator

@patrick-schultz patrick-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great, Chris! I just have a few questions/clarifications.


When a PR state changes, it will cause the `github_changed` flag to become dirty.

During its update loop, the CI service will determine whether any PRs targeting its watched branches are elibible to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
During its update loop, the CI service will determine whether any PRs targeting its watched branches are elibible to
During its update loop, the CI service will determine whether any PRs targeting its watched branches are eligible to


## Merging PRs to the `main` branch

When a PR state changes, it will cause the `github_changed` flag to become dirty.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the github_changed flag of the target branch (if that is a watched branch)?


Readiness is determined by github status checks. The following conditions must be met:

- The PR must be against the `main` branch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it main, or any watched branch? Just since you started out making the distinction carefully, we should maintain it throughout.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any watched branch, but this is automatically true or else we wouldn't be considering the PR for mergability in the first place... so I'll remove this line

The control flow from final approval to CI merging a PRs looks like:

- The PR's state will change in github (either a check changes to SUCCESS, or a review is approved)
- The github webhook callback will cause the `github_changed` flag will be marked as dirty for the `WatchedBranch`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The watched branch which is the target of the PR?

- The PR's state will change in github (either a check changes to SUCCESS, or a review is approved)
- The github webhook callback will cause the `github_changed` flag will be marked as dirty for the `WatchedBranch`
- The `WatchedBranch`'s `_update` method in [`github.py`](../../../ci/ci/github.py) scans all PRs against the branch and updates state that helps determine mergeability.
- The `WatchedBranch`'s `_update` method in [`github.py`](../../../ci/ci/github.py) iterates again to merge all mergeable PRs, in priority order
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a subtlety here which is being glossed over. As I understand (without having inspected the code, so correct me if I'm wrong), we keep a PR's check set to SUCCESS even if the successful CI run was on top of an outdated main (i.e. another PR has merged since this PR was last tested). Rerunning the tests of all PRs every time any PR merges would be wasteful, but we don't want to merge a PR unles it has a successful CI run on top of the current main. My mental model is that CI chooses one PR to be the next merge candidate, and reruns the tests if the last run is out of date, then merges only if that suceeds. But I don't know for sure if that's right.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, interesting. I didn't see evidence for this in the code but it seems logical.

I also wonder whether this is why we sometimes see PRs get "stuck" with pending checks. Perhaps CI is just choosing to delay re-running until the PR is back on top of its mergeable list 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a deep dive into this today, and you are indeed correct. I've added a flowchart showing the full set of paths through the update loop, which might be overkill in the end, but was a useful exercise for me in the process of making it.

What we seem to do is -

  • During heal:
    • Nominate a current merge candidate
      • If its tests are out of date, re-run them
  • During try_to_merge:
    • Iterate over all PRs
      • Test is_mergeable
      • Merge if mergeable (and return after the first success)

So... I think it's possible that we might kick off a "merge candidate" test and then decide to merge some other PR while those tests are running (if its tests were already up to date)

During its update loop, the CI service will determine that the SHA of the `WatchedBranch` has changed and trigger
a deployment.

- Create a new batch job to:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is also build.yaml, just with a configuration variable set to make it deploy?

that services are deploy/test-ing in parallel, and that the deploy for one service might happen before the test for
another has completed.

This should all be fine, because everything was previously tested as part of the PR approval process.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should emphasize here that a PR is only merged (really, squashed and rebased) if the exact commit that is added to main, with the exact same SHA, has passed CI.

Comment on lines 240 to 241
1. Sometimes in logs we see logs like `"update github br-hail-ci-test-ci-test-<RANDOM>-main"` for various random branch names.
What does this mean?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these branches be the results of squashing and rebasing a PR before running CI?

@cjllanwarne
Copy link
Collaborator Author

@ehigham @patrick-schultz The PR history is getting a little long and messy now, but I think I've now answered/addressed all your comments 🤞

Copy link
Collaborator

I like to resolve threads to keep track of what I've addressed. Graphite especially makes it easy to what threads are still unresolved.

Copy link
Member

@ehigham ehigham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is brilliant work, thank you very much indeed

@cjllanwarne cjllanwarne dismissed patrick-schultz’s stale review September 4, 2024 14:51

Changes made, ready for re-review

Copy link
Collaborator

@patrick-schultz patrick-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for all your work on this, I think it's super valuable!

@hail-ci-robot hail-ci-robot merged commit 258533f into hail-is:main Sep 26, 2024
3 of 4 checks passed
@cjllanwarne cjllanwarne deleted the cjl_document_ci branch September 26, 2024 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Document CI system
4 participants