Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download only recently added or changed forecaster files #233

Draft
wants to merge 10 commits into
base: dev
Choose a base branch
from

Conversation

nmdefries
Copy link
Collaborator

@nmdefries nmdefries commented Apr 6, 2022

Description

Instead of recreating predictions_cards from scratch every time the pipeline is run, download and process only files that have been added or modified since the last pipeline run. Added and modified files are selected based on commit history pulled from the Reich Lab repository via the GitHub REST API.

The new files are joined onto the predictions card object from the previous run, which is downloaded from the S3 bucket, and deduplicated in case any past predictions were modified.

The pipeline maintains the ability to regenerate the entire submission history via a manual override. The script can take two command line arguments, exhaustive-download and exhaustive-scoring (does not currently change scoring behavior). Defaults are set in the Makefile.

Addresses this task.

Bonus: refactor of Report/create_reports.R.

Changes

  • Makefile
  • s3_upload_ec2.yml (self-hosted workflow)
  • Report/create_reports.R
  • Report/fetch_data.R (new)
  • Report/process_data.R (new)

@nmdefries
Copy link
Collaborator Author

@krivard Please take a look at this when you have a chance for initial/broad feedback.

API authentication works in a test workflow run. I believe output from this branch matches output from the current pipeline but that comparison was a while ago so I'll probably repeat.

We'd previously discussed setting up a manual option to be able to have the pipeline download the full repo history (in case we need to regenerate predictions_cards.rds for some reason).

I'm not sure of the best way to do that. Currently the pipeline doesn't have a params file; the only option that it takes is a (static) command-line arg. I can set up a couple extra arguments -- I was thinking exhaustive-downloads and exhaustive-scoring -- but not sure how to get those into the make targets. One option?

# is added that backfills forecast dates, we will end up requesting all those
# dates for forecasters we've already seen before. To prevent that, make a new
# call to `get_covidhub_predictions` for each forecaster with its own dates.
predictions_cards <- lapply(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could easily parallelize this.

Copy link
Contributor

@krivard krivard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what you were asking but lmk if I'm confused: what you probably want to do is make two functions:

  • fetch_predictions_cards_updates() - which uses the lapply per-forecaster get_covidhub_predicitons call
  • fetch_predictions_cards_all() - which uses the old all-forecasters all-dates get_covidhub_predictions call

then use an environment variable to figure out which one to use when setting predictions_cards.

I'd also recommend pulling all the github parsing stuff out into a different file if you can; there's a lot there.

@krivard
Copy link
Contributor

krivard commented Apr 12, 2022

& for invoking the different behaviors, you have options:

  1. a separate make target for each configuration
  2. a single make target with variables that can be set via the shell environment
  3. a single make target with variables that are looked up in a Makefile.in import file

eg for (2)

# defaults; override using make -e
EXHAUSTIVE_DOWNLOAD:=no
EXHAUSTIVE_SCORING:=no
[...]
score_forecast: r_build dist pull_data
	docker run --rm \
		-v ${PWD}/Report:/var/forecast-eval \
		-v ${PWD}/dist:/var/dist \
		-w /var/forecast-eval \
		-e GITHUB_TOKEN=${GITHUB_TOKEN} \
		-e EXHAUSTIVE_DOWNLOAD=${EXHAUSTIVE_DOWNLOAD} \
		-e EXHAUSTIVE_SCORING=${EXHAUSTIVE_SCORING} \
		forecast-eval-build \
		Rscript create_reports.R --dir /var/dist

&

$ EXHAUSTIVE_DOWNLOAD="yes" EXHAUSTIVE_SCORING="yes" make -e score_forecast
$ EXHAUSTIVE_DOWNLOAD="yes" make -e score_forecast
$ EXHAUSTIVE_SCORING="yes" make -e score_forecast
$ make score_forecast

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants