Download only recently added or changed forecaster files #233

nmdefries · 2022-04-06T17:09:48Z

Description

Instead of recreating predictions_cards from scratch every time the pipeline is run, download and process only files that have been added or modified since the last pipeline run. Added and modified files are selected based on commit history pulled from the Reich Lab repository via the GitHub REST API.

The new files are joined onto the predictions card object from the previous run, which is downloaded from the S3 bucket, and deduplicated in case any past predictions were modified.

The pipeline maintains the ability to regenerate the entire submission history via a manual override. The script can take two command line arguments, exhaustive-download and exhaustive-scoring (does not currently change scoring behavior). Defaults are set in the Makefile.

Addresses this task.

Bonus: refactor of Report/create_reports.R.

Changes

Makefile
s3_upload_ec2.yml (self-hosted workflow)
Report/create_reports.R
Report/fetch_data.R (new)
Report/process_data.R (new)

nmdefries · 2022-04-06T21:30:51Z

@krivard Please take a look at this when you have a chance for initial/broad feedback.

API authentication works in a test workflow run. I believe output from this branch matches output from the current pipeline but that comparison was a while ago so I'll probably repeat.

We'd previously discussed setting up a manual option to be able to have the pipeline download the full repo history (in case we need to regenerate predictions_cards.rds for some reason).

I'm not sure of the best way to do that. Currently the pipeline doesn't have a params file; the only option that it takes is a (static) command-line arg. I can set up a couple extra arguments -- I was thinking exhaustive-downloads and exhaustive-scoring -- but not sure how to get those into the make targets. One option?

Report/create_reports.R

nmdefries · 2022-04-07T15:02:15Z

Report/create_reports.R

+# is added that backfills forecast dates, we will end up requesting all those
+# dates for forecasters we've already seen before. To prevent that, make a new
+# call to `get_covidhub_predictions` for each forecaster with its own dates.
+predictions_cards <- lapply(


Could easily parallelize this.

krivard

I think this is what you were asking but lmk if I'm confused: what you probably want to do is make two functions:

fetch_predictions_cards_updates() - which uses the lapply per-forecaster get_covidhub_predicitons call
fetch_predictions_cards_all() - which uses the old all-forecasters all-dates get_covidhub_predictions call

then use an environment variable to figure out which one to use when setting predictions_cards.

I'd also recommend pulling all the github parsing stuff out into a different file if you can; there's a lot there.

krivard · 2022-04-12T19:38:25Z

& for invoking the different behaviors, you have options:

a separate make target for each configuration
a single make target with variables that can be set via the shell environment
a single make target with variables that are looked up in a Makefile.in import file

eg for (2)

# defaults; override using make -e
EXHAUSTIVE_DOWNLOAD:=no
EXHAUSTIVE_SCORING:=no
[...]
score_forecast: r_build dist pull_data
	docker run --rm \
		-v ${PWD}/Report:/var/forecast-eval \
		-v ${PWD}/dist:/var/dist \
		-w /var/forecast-eval \
		-e GITHUB_TOKEN=${GITHUB_TOKEN} \
		-e EXHAUSTIVE_DOWNLOAD=${EXHAUSTIVE_DOWNLOAD} \
		-e EXHAUSTIVE_SCORING=${EXHAUSTIVE_SCORING} \
		forecast-eval-build \
		Rscript create_reports.R --dir /var/dist

&

$ EXHAUSTIVE_DOWNLOAD="yes" EXHAUSTIVE_SCORING="yes" make -e score_forecast
$ EXHAUSTIVE_DOWNLOAD="yes" make -e score_forecast
$ EXHAUSTIVE_SCORING="yes" make -e score_forecast
$ make score_forecast

nmdefries added 10 commits January 7, 2022 16:45

setup for pulling only recent changes, based on commit history

31e249f

set up basic authentication

67ac66f

comments for clarity

136331a

Merge branch 'dev' into pull-data-recent-commits

f88acef

use saved timestamp instead of today()

58eb13c

fetch each forecaster separately

b5c9fcf

check if old preds cards file exists; separate new/old and date filters

7cbda2d

setup for auto-GH workflow token to access API

a3a10f0

download predictions_cards.rds from S3

686596a

linting

d76b043

nmdefries commented Apr 7, 2022

View reviewed changes

Report/create_reports.R Show resolved Hide resolved

nmdefries commented Apr 7, 2022

View reviewed changes

Report/create_reports.R Show resolved Hide resolved

nmdefries commented Apr 7, 2022

View reviewed changes

krivard reviewed Apr 12, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download only recently added or changed forecaster files #233

Download only recently added or changed forecaster files #233

nmdefries commented Apr 6, 2022 •

edited

Loading

nmdefries commented Apr 6, 2022

nmdefries Apr 7, 2022

krivard left a comment

krivard commented Apr 12, 2022

Download only recently added or changed forecaster files #233

Are you sure you want to change the base?

Download only recently added or changed forecaster files #233

Conversation

nmdefries commented Apr 6, 2022 • edited Loading

Description

Changes

nmdefries commented Apr 6, 2022

nmdefries Apr 7, 2022

Choose a reason for hiding this comment

krivard left a comment

Choose a reason for hiding this comment

krivard commented Apr 12, 2022

nmdefries commented Apr 6, 2022 •

edited

Loading