Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GISAID covCLI bugfixes & BioSample/SRA modifications #64

Merged

Conversation

erikwolfsohn
Copy link
Contributor

@erikwolfsohn erikwolfsohn commented Aug 15, 2024

Edit: added some BioSample/SRA workflow changes and attached my config.yaml and metadata files in case you wanted to test against them.
example_config.yaml.txt
example_metadata.csv

Hey Dakota! Thanks for getting this new release out. The new handling for BioSample packages is amazing. I hugely appreciate how much this project simplifies metadata validation across multiple pathogens/packages/repositories and how convenient it makes large volume submissions.

I've tested the BioSample/SRA, and GISAID covCLI workflows - the NCBI workflows worked perfectly, but I ran into a bunch of submission failures testing covCLI.

I made some modifications and now my GISAID submissions are going through reliably. I haven't had time to do any serious testing so I can't say if all these changes will hold up, but I wanted to go ahead and submit a pull request in case there's anything that might be helpful.

The former description of this pull request is a little out of date now. I've been testing GISAID covCLI, SRA, and BioSample heavily, using the SARS-CoV-2 and OneHealth Enteric BioSample packages. Below are changes addressing workflow errors/submission failures I encountered during testing. I want to revisit a few of the changes I made, but hopefully some of them are useful!

⚙️ General

  • Remove code blocking prod submissions
  • Make fasta file an optional input

🛠️ covCLI updates/bugfixes

  • Modify several interactions w/the metadata sheet column headers
  • Adjust logfile parsing to properly capture the sample name and EPI ID from the submission log
  • Address issues causing some functions to exit prematurely when passed empty inputs (when only GISAID isolates or GISAID segments are present)
  • Prevent crashing when submission includes samples already registered in GISAID - warn the user & log virus name and EPI ID
    Edit: this is working as far as preventing crashes, but isn't reliably logging the EPI ID of the repeated samples

📋 BioSample & SRA updates/bugfixes

  • Drop empty bs- & sra- columns before metadata validation. Pandera will ignore optional columns if they don't exist - if they exist but have no data pandera will attempt to validate them and the workflow will fail. If a mandatory column has been left empty by the user and is removed by this change, pandera will still catch it.
  • If bs-description is absent, create it by joining values from organism, bs-host, bs-host_disease. bs-description isn't in the Submission Wizard template and is not required by NCBI for submission, but the workflow will fail when it isn't present.
    Edit: I want to revisit this, because it doesn't seem like missing bs-description causes a crash w/every BioSample package, and also this is kind of a bandaid anyway.
  • Warn the user if performing BioSample and SRA submissions together with Link_Sample_Between_NCBI_Databases disabled. SRA submission is likely to fail in this situation unless the user has added BioSample accessions manually beforehand.
  • Prevent workflow from failing if bs-description is missing when user is not submitting to BioSample.
  • When a biosample/sra column has mixture of data & blank rows, don't write the blank attributes to submission.xml. This causes the affected sample to fail, & there are a couple of valid reasons to have mixed columns:
    • submitting multiple pathogens under the same biosample template w/different optional fields completed
    • submitting mix of paired and single end reads to SRA

Testing data was generated via:

python seqsender.py test_data --biosample --sra --gisaid --organism COV --submission_dir test_data/CCPHL/

Metadata and config templates were created with the Shiny app Submission Wizard

And the workflow was run with this command:

python seqsender.py submit --biosample --sra --gisaid --organism COV --submission_dir test_data/CCPHL/ --submission_name COV_TEST_DATA --config_file test_data/CCPHL/cov_ccphl_config.yaml --metadata_file test_data/CCPHL/meta2.csv --fasta_file test_data/CCPHL/sequence.fasta --test

… & output file parsing. Added some additional error handling to prevent submission failures.
optional biosample/sra columns are empty
bs-description is not present
@erikwolfsohn erikwolfsohn changed the title GISAID covCLI modifications/bugfixes GISAID covCLI bugfixes & BioSample/SRA modifications Aug 17, 2024
…ly created when biosample is being submitted, fixed capitalization
… biosample or SRA attributes. this is applicable when:

-biosample: submitting multiple pathogens under same biosample template w/different optional fields
-SRA: submitting mix of single & paired end reads
@dthoward96 dthoward96 changed the base branch from master to v1.2.1.-Bug-Fixes-Update September 4, 2024 18:16
@dthoward96 dthoward96 merged commit 971bfba into CDCgov:v1.2.1.-Bug-Fixes-Update Sep 6, 2024
@dthoward96
Copy link
Collaborator

dthoward96 commented Sep 11, 2024

Hey @erikwolfsohn,

Thanks for all the changes you've contributed! I've incorporated them into the updates I was already working on for the v1.2.1 update and have also expanded upon some of the additions you made as well.

I expanded the try/catch for file permission errors to cover all files being generated by the file_handler.py script. I really liked your changes for GISAID but I made a couple of modifications to eliminate the database prefix issues and changed how the logging occurs as I noticed there could be an error where the incorrect sample name might be used. The bs-description field should be resolved as I also changed how the config file works for the description title and comment. I've split and renamed the "bs-description" field into "bs-sample_title" and "bs-sample_description" with updated documentation in the templates so that should provide a better explanation of those fields. I also provided some additional info in the shiny issue you had created as well.

Thanks, again,
-Dakota

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants