-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add representative rechunking workflow #1536
Conversation
.github/workflows/geospatial.yml
Outdated
@@ -0,0 +1,77 @@ | |||
name: Geospatial Workflows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the intention behind having a separate CI workflow for this? For context, we have a separate workflow for TPC-H because we needed to be able to run these independently when preparing our TPC-H benchmarking data/post.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the extra context, that's useful. I was following the TPC-H precedent of using a separate workflow for handling optionally running large versions of tests. It also seemed easier to get something started. Agreed that maintaining a separate workflow isn't ideal -- are there other downsides to not using the main test.yml
for this (still getting up to speed on how CI works here)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the extra workflow file. We're now running the small version of this test on our normal test workflow (here is an example run). The --scale
switch still exists so we can trigger the larger version when needed.
# 101.83 GiB (small) | ||
time_range = slice("2020-01-01", "2023-01-01") | ||
else: | ||
# 2.12 TiB (large) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How expensive is it to run this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example of the large version running https://cloud.coiled.io/clusters/573762/account/dask-engineering-gcp/information?workspace=dask-engineering-gcp which cost $10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(no need to engage on that comment in this PR, just an observation that interested me)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @jrbourbeau. The code looks good to me, there's one non-blocking question about the auth step.
- name: Google auth | ||
uses: "google-github-actions/auth@v2" | ||
with: | ||
credentials_json: "${{ secrets.GCP_CREDENTIALS }}" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this step necessary? If I had to guess, it should be enough to pass the credentials to the GCSFileSystem
in the fixture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This enables Coiled to pick up the credentials and forward them to the cluster for use when writing to the restricted GCS bucket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation!
result = subset.chunk({"time": -1, "longitude": "auto", "latitude": "auto"}) | ||
|
||
# Write result to cloud storage | ||
result.to_zarr(gcs_url, storage_options={"token": CoiledShippedCredentials()}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is where code explicitly uses forwarded temporary credentials
Based on the example here https://gist.github.com/jrbourbeau/183a47cad1f32c6bf9bc94b546fab960
EDIT: Here's an example of this running successfully in CI https://github.com/coiled/benchmarks/actions/runs/10584886955/job/29330190062