Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding imputed sex labels #2207

Open
wants to merge 160 commits into
base: dev
Choose a base branch
from
Open

Adding imputed sex labels #2207

wants to merge 160 commits into from

Conversation

erflynn
Copy link

@erflynn erflynn commented Mar 25, 2020

Issue Number

This addresses issue number #2181.

Purpose/Implementation Notes

This update includes imputed sex labels for microarray data (mouse, rat, and human).

I will update with more labels soon; I am not sure if you want to integrate this pull request now or come back to it later when I have more labels? I am sending now so we can work through methods, format, etc.

Methods

Details are included in the below (this is also in the README within the config/externally_supplied_metadata/directory).
All code to produce this update is included in the sl_label repository and can be easily applied to other organisms who have XX/XY sex determination (provided sufficient metadata sex labels are available for model training).

The majority of gene expression data is missing metadata sex labels (see Table 1). This lack of labels prevents us from examining the sex breakdown of many studies. We trained a penalized logistic regression model that uses the expression of X and Y chromosome genes to impute the sex of a given sample. The model was trained using the glmnet R package, with the elastic net penalty. Lambda was selected in ten-fold cross validation. We used metadata sex labels for "ground truth" for the training and testing data. To construct the training and testing datasets, we filtered for samples with metadata sex labels that did not have a cell line annotation (using the refine-bio sex and cell_line tags), and grouped these samples into studies. We divided the set of all studies in half, and then within each half, sampled n=700 samples for training, and n=300 samples for testing for both males and females. This provided a balanced training and testing dataset, where none of the samples in the test data were from a study that had been seen in the training.

A previous study [1] indicated that there is widespread mis-annotation in metadata sex labels, because of this, we set up additional testing datasets: a high confidence mixed sex dataset and single sex datasets. The high confidence mixed sex dataset consists of all mixed sex studies with at least five male and five female samples where the metadata sex labels match expression-based labels from at least one of two clustering-based sex imputation methods [1,2] (we did not apply clustering based methods to the entire refine-bio dataset because they have poor performance on small studies, single sex studies, and studies with high class imbalance). The single sex datasets are all studies with at least ten samples and all male or all female labels as indicated by metadata sex labels.

Across all three organisms, our models achieve approximately 95% accuracy in a randomly selected held-out test set as compared to the metadata labels. Additionally, we assessed the accuracy of our model, on various subsets of the data; comparing to all metadata sex labels (agreement 93.5-94.8%), a random sample of single sex studies (agreement 92.6-96.5%), and, in human, manually annotated sex labels from a previous analysis [3] (94.2%) (see Table 2).

organism Samples (n) Samples missing metadata annotation Studies (n) Studies missing metadata annotation
human 430119 74.90% 14987 87.30%
mouse 228707 74.00% 12995 80.90%
rat 31361 55.50% 1295 65.30%

Table 1. Metadata missingness for sex labels.

dataset Human Mouse Rat
Training (n=1400) 95.60% 96.10% 98.60%
Testing (n=600) 95.20% 95.70% 95.50%
Metadata 93.5% (107748) 93.5% (58473) 94.8% (13995)
High-confidence 97.3% (7301) 95.5% (3968) n/a
Single sex - f 96.5% (12919) 93.4% (13243) 96.3% (2240)
Single sex - m 92.6% (8128) 93.6% (30225) 95.8% (12689)
Manual annotations 94.2% (8289) n/a n/a

Table 2. Concordance of sex labels. Numbers in parentheses indicate the total number of samples, percentages the number of samples that agree divided by the total number of samples. High confidence labels have matching metadata and clustering based expression labels.

The cleaned metadata sex labels are also included in the cleaned_metadata/ directory for microarray and RNA-seq. This process mapped all harmonized sex labels to "male", "female", "mixed", or "unknown". Code for this is included in 01_metadata within the erflynn/sl_label repository here

Types of changes

This includes externally supplied metadata files in .json format. These are under
config/externally_supplied_metadata/. The format is as follows (and as discussed in Issue #2127)

{"sample_accession": "<SAMPLE_ACCESSION_CODE>",
 "attributes": [{"PATO:0000047": {"value": <VALUE>,
                                          "probability": <PROBABILITY>}                } ]
}

where value is one of "PATO:0000383" (female) or "PATO:0000384" (male) and probability is P(imputed_sex=value) from the logistic regression model.

Cleaned harmonized metadata for sex is included in the cleaned_metadata/ directory with the columns "acc" (sample accession), "sex" (the harmonized sex label), and "mapped_sex" (the harmonized sex label mapped to "male", "female", "mixed", or "unknown").

Screenshot

Sex sample breakdown using metadata and (imputed) expression labels for each organism.
sex_breakdown_microarray.pdf

References

[1] Toker, L., et al. F1000Research. 2016, 5: 2103.
[2] Buckberry, S., et al. Bioinformatics. 2014, 30(14): 2084–2085.
[3] Giles, C. B, et al. BMC Bioinformatics. 2017,18(Suppl 14): 509.

arielsvn and others added 30 commits July 3, 2019 12:04
[hotfix] Bring master up to date with dev and redeploy to bring the API up
Deploy small fixes for the foreman.
Improve our ability to not retry jobs that shouldn't be retried.
[HOTFIX] Deploy compedium fixes to production
[HOTFIX] Deploy config tuning for prod
[HOTFIX] Deploy a couple minor changes to prod
Brings master up to date with dev, deploying many changes including some tuning
[HOTFIX] Fix migration to run in SQL instead of python
[HOTFIX] Disable stats background scheduler
[HOTFIX] Deploy foreman and background scheduler fixes
Bring master up to date with dev to deploy improvements and fixes
[HOTFIX] Bumps max_clients to 8, don't increase RAM when instance cycling.
[HOTFIX] Deploy fix for SRA surveyor
Deploy surveyor jobs running on smasher
[HOTFIX] Deploy new feed-the-beast, up salmon timeout
[HOTFIX] Merge missing commit from the feed the beast branch
[HOTFIX} Deploy fixes to the beast feeder
[HOTFIX] Improves the way we manage the queue of downloader jobs.
Bumps up max clients to scale up a little more
[HOTFIX] Deploy old salmon version rerunning! (and others)
[HOTFIX] Don't scale up number of volumes
[HOTFIX] Deploy salmon rerun fix and scale up larger
[HOTFIX] Bumps max clients up to 14
[HOTFIX] Deploy migration that was left out before
kurtwheeler and others added 11 commits February 25, 2020 13:29
…revert-2151-dev

Revert "Revert "[DEPLOY] Fix for microbes and GSE75083""
…revert-2155-dev

Revert "Revert "[DEPLOY] Fix pgbouncer/RDS remaining issues""
[DEPLOY] Finally get pgbouncer traffic routing working correctly
[DEPLOY] Deploy final fixes to pg_bouncer config and fix intermittent test failures.
[HOTFIX] [DEPLOY] Patch long query in smasher jobs
[DEPLOY] Trigger a deploy of quite a lot of stuff
[DEPLOY] Bump volume size for smasher instance
Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @erflynn, thanks so much for filing this! I'm Jaclyn, I'm a scientist at the CCDL. 👋 @kurtwheeler asked me to take a look at the methods section of your pull request.

In general, these look good to me! I think the level of detail you have included in the README looks good, but I had a few comments about linking to specific parts of the repo you mentioned (erflynn/sl_label) or previous analyses in case folks want to dig in a bit more at a later date. In addition, I had a question about the content of the CSV files included here – they may be as expected but it was not intuitive to me in the context of the README, which might indicate that that may be an area where we should expand the documentation a bit.

Thanks again! Please let us know if you need anything!

config/externally_supplied_metadata/README.md Outdated Show resolved Hide resolved

| dataset | Human | Mouse | Rat |
| ----- | ---- | ---- | ---- |
| Training (n=1400) | 95.60% | 96.10% | 98.60% |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect there might be more information about the training and testing set in the https://github.com/erflynn/sl_label about things like class balance. Can we add a link to that information in the paragraph above that begins with The majority of gene expression data is missing metadata sex labels please?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some information! thanks for the suggestion


This update includes imputed sex labels for microarray data (mouse, rat, and human).

The majority of gene expression data is missing metadata sex labels (see Table 1). This lack of labels prevents us from examining the breakdown by sex of many studies. We used the expression of X and Y chromosome genes and metadata sex labels to train a logistic regression model (with elastic net penalty) to predict sample sex. Across all three organisms, our models achieve approximately 95% accuracy in a randomly selected held-out test set as compared to the metadata labels. Additionally, we assessed the accuracy of our model, on various subsets of the data; comparing to all metadata sex labels (agreement 93.5-94.8%), a random sample of single sex studies (agreement 92.6-96.5%), and, in human, manually annotated sex labels from a previous analysis (94.2%) (see Table 2).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a link to a previous analysis that you would be comfortable putting here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops! missing citation here, added in. apologies!


| organism | Samples (n) | Samples missing metadata annotation | Studies (n) | Studies missing metadata annotation |
| ----- | ---- | ---- | ---- | ---- |
| human | 430119 | 74.90% | 14987 | 87.30% |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was noticing the line counts for the CSV files - I would expect for human_rnaseq + human_microarray not to exceed the number Samples (n) in this table. I took a closer look at the tabular data in the CSV files and it looks like there are duplicate values in the acc column and in the microarray file, there are run accessions that are consistent with RNA-seq data (e.g., DRR, ERR). There are about 99k accessions shared across the human RNA-seq and microarray files but looking at the human RNA-seq file there are no GEO sample accessions (e.g., GSM). This may totally be intentional, but it is a bit different than what I would expect given the context in this document.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re the csv files - you are right, there are duplicates, apologies here! I have updated these, and put in a commit that fixes that. The numbers now exactly match those in the table.

For the run accessions, these are exactly the set of accessions that are included in the aggregated_metadata.json files in the respective compendia (the normalized microarray and then the RNA-seq compendia). You are right that there is some overlap -- I will look into this more -- but the accessions are the same. Is there a different way that you usually organize this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the run accessions, these are exactly the set of accessions that are included in the aggregated_metadata.json files in the respective compendia (the normalized microarray and then the RNA-seq compendia). You are right that there is some overlap -- I will look into this more -- but the accessions are the same.

Ah okay I see! Then I would expect some overlap because the normalized compendium contains both microarray and RNA-seq data. So maybe it would be better to call these files normalized and rnaseq_sample rather than microarray and rnaseq to match the terminology here: https://www.refine.bio/compendia.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok! I will update this, thank you!

config/externally_supplied_metadata/README.md Outdated Show resolved Hide resolved
@erflynn
Copy link
Author

erflynn commented Apr 1, 2020

Thanks for the feedback @jaclyn-taroni. This is very helpful!

I updated the README to include more description, and fixed the duplicates within the .csv files. I am not sure what the coverage difference is between the compendia? But the samples should match the aggregated_metadata.json files. I included a link for the code for this.

I also have the RNA-seq sex labels for all of the files that I could find (Issue #2211), and will amend the pull request to include these.

For linking to the erflynn/sl_label repo - perhaps there is a better way I can organize linking to this? I'm trying to reorganize the repo to make it clearer as to what I did.

@jaclyn-taroni
Copy link
Member

Hi @erflynn - thanks for updating your comment to include all of this info!

I have given this question a bit of thought:

For linking to the erflynn/sl_label repo - perhaps there is a better way I can organize linking to this? I'm trying to reorganize the repo to make it clearer as to what I did.

I think we definitely want to make sure all of the information you've now added to your comment makes it into a version-controlled Markdown document. Now the question becomes where because if you have to write things up in two places that's more difficult to maintain. Some other considerations:

  1. I expect that erflynn/sl_label will always be "ahead" of AlexsLemonade/refinebio in regards to what's in config/externally_supplied_metadata/ just based on how I expect development will go.
  2. Sometimes (okay, often...nearly always in my personal experience) it's helpful to have someone who has not done an analysis review the documentation around the analysis because they have some distance.

There are two ways to address this that come to mind for me.

We can either have all of the documentation that's relevant to a particular "release" of labels (the CSV and JSON files in AlexsLemonade:dev) in config/externally_supplied_metadata/README.md. Someone from our team would review that documentation each time you file a pull request. In that documentation, you can use permalinks to link to the state of the code in erflynn/sl_label that was current as of filing the PR to AlexsLemonade/refinebio.

Another approach would be to keep all of the documentation in erflynn/sl_label alongside the code and use permalinks to the documentation that was current as of filing a PR to AlexsLemonade/refinebio in config/externally_supplied_metadata/README.md. A downside of this approach is that the documentation review step is not built-in, but one way to approach that would be to add someone from our team as a collaborator and only file PRs for documentation over on erflynn/sl_label.

Option 1 is probably more straightforward in my opinion, but would like to hear your thoughts. We can talk specifics of organizing the linking once we decide on a general strategy. Please let me know if anything is unclear! Thanks!

Co-Authored-By: Jaclyn Taroni <jaclyn.n.taroni@gmail.com>
@erflynn
Copy link
Author

erflynn commented Apr 7, 2020

Hi @jaclyn-taroni,

Thanks for thinking this through!

I agree that erflynn/sl_label will be ahead. I also agree it is helpful to have someone review the analysis.

I think option 1 makes more sense, I would prefer to have the documentation review step built-in.

One thought would be whether it is ok if the config/externally_supplied_metadata/README.md contains permalinks to other documentation in erflynn/sl_label, e.g. a link to documentation about how to run the code in erflynn/sl_label as well?

I would like to add instructions for how to run the code in the erflynn/sl_label repo to create the version of the output that is added in a pull request, but I think the nitty gritty of that (rather than the high-level detailed description, which can go in the config/externally_supplied_metadata/README.md) could go in the documentation in the erflynn/sl_label repository, and be linked from the README.md.

Thanks again,
Emily

@jaclyn-taroni
Copy link
Member

Hi @erflynn -

One thought would be whether it is ok if the config/externally_supplied_metadata/README.md contains permalinks to other documentation in erflynn/sl_label, e.g. a link to documentation about how to run the code in erflynn/sl_label as well?

I would like to add instructions for how to run the code in the erflynn/sl_label repo to create the version of the output that is added in a pull request, but I think the nitty gritty of that (rather than the high-level detailed description, which can go in the config/externally_supplied_metadata/README.md) could go in the documentation in the erflynn/sl_label repository, and be linked from the README.md.

This sounds perfect to me! My biggest concern would be tying it to documentation in erflynn/sl_label that is in sync with what's in this repo, but the permalink solution you've laid out covers that. Thank you!

@erflynn
Copy link
Author

erflynn commented Apr 10, 2020

Awesome - I will get this set up, and send an updated PR.

Thanks for the feedback!

@cgreene
Copy link
Contributor

cgreene commented Aug 3, 2020

Hi @erflynn - we're starting to adjust how our metadata are stored so that we can import this. How are things going on your end?

@erflynn
Copy link
Author

erflynn commented Aug 3, 2020

Hi @cgreene - thanks for reaching out. That's great news!

Things are going well. I was wrapping up another project, but am back to working on this manuscript.

One note about the metadata labels: I noticed last week that the harmonized metadata sex labels are somewhat lossy in how they're parsed (e.g. https://www.ebi.ac.uk/ena/data/view/SRS1752937&display=xml has sex information but it does not make it into the harmonized data). I've found this in SRA data, but haven't looked into compendia data yet but I expect this might be similar. I'm working on fixing this right now because I need a more accurate assessment of metadata coverage. I can send an update or post an issue later this week with the expanded set of labels and the parsing code that I write.

@kurtwheeler
Copy link
Contributor

Hi @erflynn! We've added some metadata from MetaSRA to the config/externally_supplied_metadata directory. At some point before you merge this, would you mind moving all your files down one level into config/externally_supplied_metadata/erflynn?

@erflynn
Copy link
Author

erflynn commented Aug 14, 2020

not a problem! will do. and I'll add some updated labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants