Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pudl usage metrics gcp infrastructure #3841

Merged
merged 4 commits into from
Sep 17, 2024

Conversation

bendnorman
Copy link
Member

@bendnorman bendnorman commented Sep 11, 2024

Overview

  • Mount the Cloud SQL pudl-usage-metrics-db to Superset cloud run instance
  • Create a new bucket for raw usage metrics archives
  • Give github action service account permission to write to the bucket

Testing

The usage metrics data can be viewed in superset and the github metrics are being saved to the new bucket.

To-do list

- Mount pudl-usage-metrics db to superset cloud run
- Create a new bucket for raw usage metrics archives
- Give github action service account permision to write to the bucket
Comment on lines +400 to +405
resource "google_secret_manager_secret" "pudl_usage_metrics_db_connection_string" {
secret_id = "pudl-usage-metrics-db-connection-string"
replication {
auto {}
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured I'd save the connection string in case we need to reconnect the db to superset.

Comment on lines +415 to +418
resource "google_service_account" "usage_metrics_archiver" {
account_id = "usage-metrics-archiver"
display_name = "PUDL usage metrics archiver github action service account"
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did create a service account key for the GitHub action in the business repo. @jdangerx would love a WIF tutorial soon!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've forgotten everything I know about WIF but could re-learn it!

@bendnorman bendnorman added cloud Stuff that has to do with adapting PUDL to work in cloud computing context. superset labels Sep 11, 2024
}

resource "google_storage_bucket_iam_member" "usage_metrics_etl_gcs_iam" {
for_each = toset(["roles/storage.legacyBucketReader", "roles/storage.objectViewer"])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a non legacy role that gives a principle the storage.buckets.get and storage.objects.get permissions. Seems like the GCS python client wants both to access objects in a bucket.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably fine. If we want to switch to non-legacy roles, it looks like we could give roles/storage.objectUser for objects.get and the confusingly named roles/storage.insightsCollectorService gets you buckets.get.

Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks fine, just wondering if we should manage the s3-logs bucket via terraform as well. Other than this you're good to go!

Comment on lines +415 to +418
resource "google_service_account" "usage_metrics_archiver" {
account_id = "usage-metrics-archiver"
display_name = "PUDL usage metrics archiver github action service account"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've forgotten everything I know about WIF but could re-learn it!

}

resource "google_storage_bucket_iam_member" "usage_metrics_etl_gcs_iam" {
for_each = toset(["roles/storage.legacyBucketReader", "roles/storage.objectViewer"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably fine. If we want to switch to non-legacy roles, it looks like we could give roles/storage.objectUser for objects.get and the confusingly named roles/storage.insightsCollectorService gets you buckets.get.

resource "google_storage_bucket_iam_member" "usage_metrics_etl_s3_logs_gcs_iam" {
for_each = toset(["roles/storage.legacyBucketReader", "roles/storage.objectViewer"])

bucket = "pudl-s3-logs.catalyst.coop"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we manage this bucket via TF too?

Copy link
Member Author

@bendnorman bendnorman Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably! How can we manage a resource in terraform that has already been created in the UI? Also, we should probably move the contents of pudl-s3-logs.catalyst.coop to pudl-usage-metrics-archives.catalyst.coop/s3/ for consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the documentation about handling "resource drift"!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need 4 different name components on that bucket? Is there a non-archives bucket that we need to differentiate it from? Do we foresee having non-PUDL usage metrics that would need to be stored somewhere else? Or could we put all usage metrics data under usage-metrics.catalyst.coop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably don't need the pudl part because the bucket lives in the pudl-catalyst-cooperative GCP project. It would be nice to keep archive because I could imagine having a bucket to store the outputs of our usage metric ETL to parquet files if we ever move to Big Query. How about usage-metrics-archive.catalyst.coop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bendnorman Would love to coordinate a name change with the flight of PRs in pudl-usage-metrics as much as possible, so just let me know what you're thinking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll keep it as is for now but I created an issue to rename it down the line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also created an issue for moving the pudl-s3-logs.catalyst.coop bucket.

@e-belfer
Copy link
Member

e-belfer commented Sep 17, 2024

@bendnorman A question from @jdangerx on this usage-metrics PR I just merged that is probably more relevant here: catalyst-cooperative/pudl-usage-metrics#167 (comment)

(Tldr - do we need to terraform the account credentials set up on the daily raw logs archiver?)

@jdangerx
Copy link
Member

I think, if the name of the bucket is still under discussion & we're considering moving buckets, then we should just do that as a separate PR and get this merged in.

@bendnorman bendnorman added this pull request to the merge queue Sep 17, 2024
Merged via the queue into main with commit 410708e Sep 17, 2024
17 checks passed
@bendnorman bendnorman deleted the add-pudl-usage-metrics-gcp-infra branch September 17, 2024 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud Stuff that has to do with adapting PUDL to work in cloud computing context. superset
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants