Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add representative rechunking workflow #1536

Merged
merged 15 commits into from
Sep 5, 2024
Merged

Add representative rechunking workflow #1536

merged 15 commits into from
Sep 5, 2024

Conversation

jrbourbeau
Copy link
Member

@jrbourbeau jrbourbeau commented Aug 26, 2024

.github/workflows/tests.yml Outdated Show resolved Hide resolved
ci/environment-git-tip.yml Outdated Show resolved Hide resolved
tests/geospatial/test_rechunking.py Outdated Show resolved Hide resolved
tests/geospatial/test_rechunking.py Outdated Show resolved Hide resolved
@hendrikmakait hendrikmakait self-requested a review August 30, 2024 14:37
@jrbourbeau jrbourbeau changed the title [WIP] Add representative rechunking workflow Add representative rechunking workflow Aug 30, 2024
@@ -0,0 +1,77 @@
name: Geospatial Workflows
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the intention behind having a separate CI workflow for this? For context, we have a separate workflow for TPC-H because we needed to be able to run these independently when preparing our TPC-H benchmarking data/post.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the extra context, that's useful. I was following the TPC-H precedent of using a separate workflow for handling optionally running large versions of tests. It also seemed easier to get something started. Agreed that maintaining a separate workflow isn't ideal -- are there other downsides to not using the main test.yml for this (still getting up to speed on how CI works here)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the extra workflow file. We're now running the small version of this test on our normal test workflow (here is an example run). The --scale switch still exists so we can trigger the larger version when needed.

# 101.83 GiB (small)
time_range = slice("2020-01-01", "2023-01-01")
else:
# 2.12 TiB (large)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How expensive is it to run this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a distracting tangent, but the disk activity there is oddly cyclical

Screenshot 2024-09-03 at 9 57 40 AM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(no need to engage on that comment in this PR, just an observation that interested me)

Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @jrbourbeau. The code looks good to me, there's one non-blocking question about the auth step.

Comment on lines +124 to +128
- name: Google auth
uses: "google-github-actions/auth@v2"
with:
credentials_json: "${{ secrets.GCP_CREDENTIALS }}"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this step necessary? If I had to guess, it should be enough to pass the credentials to the GCSFileSystem in the fixture.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enables Coiled to pick up the credentials and forward them to the cluster for use when writing to the restricted GCS bucket.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation!

result = subset.chunk({"time": -1, "longitude": "auto", "latitude": "auto"})

# Write result to cloud storage
result.to_zarr(gcs_url, storage_options={"token": CoiledShippedCredentials()})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is where code explicitly uses forwarded temporary credentials

@jrbourbeau jrbourbeau merged commit c626d45 into main Sep 5, 2024
15 checks passed
@jrbourbeau jrbourbeau deleted the era5-rechunking branch September 5, 2024 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants