Add pudl usage metrics gcp infrastructure #3841

bendnorman · 2024-09-11T17:25:29Z

Overview

Mount the Cloud SQL pudl-usage-metrics-db to Superset cloud run instance
Create a new bucket for raw usage metrics archives
Give github action service account permission to write to the bucket

Testing

The usage metrics data can be viewed in superset and the github metrics are being saved to the new bucket.

To-do list

Give feedback

If updating analyses or data processing functions: make sure to update or write data validation tests (e.g. test_minmax_rows())
Update the release notes: reference the PR and related issues.
Run make pytest-coverage locally to ensure that the merge queue will accept your PR.
Review the PR yourself and call out any questions or issues you have
For minor ETL changes or data additions, once make pytest-coverage passes, make sure you have a fresh full PUDL DB downloaded locally, materialize new/changed assets and all their downstream assets and run relevant data validation tests using pytest and --live-dbs.
For bigger ETL or data changes run the full ETL locally and then run the data validations using make pytest-validate.
Alternatively, run the build-deploy-pudl GitHub Action manually.
Options

- Mount pudl-usage-metrics db to superset cloud run - Create a new bucket for raw usage metrics archives - Give github action service account permision to write to the bucket

bendnorman · 2024-09-11T17:27:10Z

terraform/main.tf

+resource "google_secret_manager_secret" "pudl_usage_metrics_db_connection_string" {
+  secret_id = "pudl-usage-metrics-db-connection-string"
+  replication {
+    auto {}
+  }
+}


I figured I'd save the connection string in case we need to reconnect the db to superset.

bendnorman · 2024-09-11T17:28:11Z

terraform/main.tf

+resource "google_service_account" "usage_metrics_archiver" {
+  account_id   = "usage-metrics-archiver"
+  display_name = "PUDL usage metrics archiver github action service account"
+}


I did create a service account key for the GitHub action in the business repo. @jdangerx would love a WIF tutorial soon!

I've forgotten everything I know about WIF but could re-learn it!

…rchive bucket

…rchive bucket and list buckets, previous commit was missing a role

bendnorman · 2024-09-11T18:50:55Z

terraform/main.tf

+}
+
+resource "google_storage_bucket_iam_member" "usage_metrics_etl_gcs_iam" {
+  for_each = toset(["roles/storage.legacyBucketReader", "roles/storage.objectViewer"])


I couldn't find a non legacy role that gives a principle the storage.buckets.get and storage.objects.get permissions. Seems like the GCS python client wants both to access objects in a bucket.

This is probably fine. If we want to switch to non-legacy roles, it looks like we could give roles/storage.objectUser for objects.get and the confusingly named roles/storage.insightsCollectorService gets you buckets.get.

jdangerx

This all looks fine, just wondering if we should manage the s3-logs bucket via terraform as well. Other than this you're good to go!

jdangerx · 2024-09-12T20:45:03Z

terraform/main.tf

+resource "google_service_account" "usage_metrics_archiver" {
+  account_id   = "usage-metrics-archiver"
+  display_name = "PUDL usage metrics archiver github action service account"
+}


I've forgotten everything I know about WIF but could re-learn it!

jdangerx · 2024-09-12T20:46:51Z

terraform/main.tf

+}
+
+resource "google_storage_bucket_iam_member" "usage_metrics_etl_gcs_iam" {
+  for_each = toset(["roles/storage.legacyBucketReader", "roles/storage.objectViewer"])


This is probably fine. If we want to switch to non-legacy roles, it looks like we could give roles/storage.objectUser for objects.get and the confusingly named roles/storage.insightsCollectorService gets you buckets.get.

jdangerx · 2024-09-12T20:50:58Z

terraform/main.tf

+resource "google_storage_bucket_iam_member" "usage_metrics_etl_s3_logs_gcs_iam" {
+  for_each = toset(["roles/storage.legacyBucketReader", "roles/storage.objectViewer"])
+
+  bucket = "pudl-s3-logs.catalyst.coop"


Should we manage this bucket via TF too?

Probably! How can we manage a resource in terraform that has already been created in the UI? Also, we should probably move the contents of pudl-s3-logs.catalyst.coop to pudl-usage-metrics-archives.catalyst.coop/s3/ for consistency.

Here's the documentation about handling "resource drift"!

Do we really need 4 different name components on that bucket? Is there a non-archives bucket that we need to differentiate it from? Do we foresee having non-PUDL usage metrics that would need to be stored somewhere else? Or could we put all usage metrics data under usage-metrics.catalyst.coop?

I think we probably don't need the pudl part because the bucket lives in the pudl-catalyst-cooperative GCP project. It would be nice to keep archive because I could imagine having a bucket to store the outputs of our usage metric ETL to parquet files if we ever move to Big Query. How about usage-metrics-archive.catalyst.coop?

@bendnorman Would love to coordinate a name change with the flight of PRs in pudl-usage-metrics as much as possible, so just let me know what you're thinking.

I think we'll keep it as is for now but I created an issue to rename it down the line.

Also created an issue for moving the pudl-s3-logs.catalyst.coop bucket.

e-belfer · 2024-09-17T13:56:37Z

@bendnorman A question from @jdangerx on this usage-metrics PR I just merged that is probably more relevant here: catalyst-cooperative/pudl-usage-metrics#167 (comment)

(Tldr - do we need to terraform the account credentials set up on the daily raw logs archiver?)

jdangerx · 2024-09-17T14:46:40Z

I think, if the name of the bucket is still under discussion & we're considering moving buckets, then we should just do that as a separate PR and get this merged in.

Add pudl usage metrics gcp infrastructure

63f524d

- Mount pudl-usage-metrics db to superset cloud run - Create a new bucket for raw usage metrics archives - Give github action service account permision to write to the bucket

bendnorman commented Sep 11, 2024

View reviewed changes

bendnorman requested a review from jdangerx September 11, 2024 17:28

bendnorman added cloud Stuff that has to do with adapting PUDL to work in cloud computing context. superset labels Sep 11, 2024

bendnorman requested a review from e-belfer September 11, 2024 17:29

e-belfer mentioned this pull request Sep 11, 2024

Revitalize the collection of PUDL usage metrics catalyst-cooperative/pudl-usage-metrics#128

Open

bendnorman added 2 commits September 11, 2024 10:27

Give pudl usage metrics etl service account permission to read from a…

1aaac08

…rchive bucket

Give pudl usage metrics etl service account permission to read from a…

fe095b9

…rchive bucket and list buckets, previous commit was missing a role

bendnorman commented Sep 11, 2024

View reviewed changes

Give usage metrics service account permission on s3 logs bucket

64234ba

jdangerx requested changes Sep 12, 2024

View reviewed changes

e-belfer mentioned this pull request Sep 17, 2024

Debug daily metrics GHA catalyst-cooperative/pudl-usage-metrics#167

Merged

jdangerx approved these changes Sep 17, 2024

View reviewed changes

bendnorman mentioned this pull request Sep 17, 2024

Rename usage metrics archive bucket catalyst-cooperative/pudl-usage-metrics#170

Open

bendnorman added this pull request to the merge queue Sep 17, 2024

Merged via the queue into main with commit 410708e Sep 17, 2024
17 checks passed

bendnorman deleted the add-pudl-usage-metrics-gcp-infra branch September 17, 2024 19:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pudl usage metrics gcp infrastructure #3841

Add pudl usage metrics gcp infrastructure #3841

bendnorman commented Sep 11, 2024 •

edited

Loading

To-do list

bendnorman Sep 11, 2024

bendnorman Sep 11, 2024

jdangerx Sep 12, 2024

bendnorman Sep 11, 2024

jdangerx Sep 12, 2024

jdangerx left a comment

jdangerx Sep 12, 2024

jdangerx Sep 12, 2024

jdangerx Sep 12, 2024

bendnorman Sep 13, 2024 •

edited

Loading

jdangerx Sep 16, 2024

zaneselvans Sep 16, 2024

bendnorman Sep 17, 2024

e-belfer Sep 17, 2024

bendnorman Sep 17, 2024

bendnorman Sep 17, 2024

e-belfer commented Sep 17, 2024 •

edited

Loading

jdangerx commented Sep 17, 2024

Add pudl usage metrics gcp infrastructure #3841

Add pudl usage metrics gcp infrastructure #3841

Conversation

bendnorman commented Sep 11, 2024 • edited Loading

Overview

Testing

To-do list

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdangerx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bendnorman Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

e-belfer commented Sep 17, 2024 • edited Loading

jdangerx commented Sep 17, 2024

bendnorman commented Sep 11, 2024 •

edited

Loading

bendnorman Sep 13, 2024 •

edited

Loading

e-belfer commented Sep 17, 2024 •

edited

Loading