Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding imputed sex labels #2207

Open
wants to merge 160 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 153 commits
Commits
Show all changes
160 commits
Select commit Hold shift + click to select a range
bb9f5fb
Merge pull request #1367 from AlexsLemonade/dev
arielsvn Jul 3, 2019
e861290
Merge pull request #1383 from AlexsLemonade/dev
arielsvn Jul 12, 2019
8c2b3c6
Merge pull request #1425 from AlexsLemonade/dev
kurtwheeler Jul 30, 2019
d561189
Merge pull request #1430 from AlexsLemonade/dev
kurtwheeler Jul 30, 2019
2affb92
Merge pull request #1434 from AlexsLemonade/dev
kurtwheeler Jul 31, 2019
e58d12e
Merge pull request #1439 from AlexsLemonade/dev
kurtwheeler Jul 31, 2019
0958459
Merge pull request #1444 from AlexsLemonade/dev
kurtwheeler Aug 1, 2019
161f639
Merge pull request #1449 from AlexsLemonade/dev
kurtwheeler Aug 2, 2019
77b3b3d
Merge pull request #1451 from AlexsLemonade/dev
kurtwheeler Aug 5, 2019
165e5b5
Merge pull request #1460 from AlexsLemonade/dev
kurtwheeler Aug 8, 2019
ab826cb
Merge pull request #1464 from AlexsLemonade/dev
kurtwheeler Aug 8, 2019
aec7673
Merge pull request #1467 from AlexsLemonade/dev
kurtwheeler Aug 8, 2019
a312fb9
Merge pull request #1471 from AlexsLemonade/dev
kurtwheeler Aug 9, 2019
759fa61
Merge pull request #1475 from AlexsLemonade/dev
kurtwheeler Aug 9, 2019
5de1a74
Merge pull request #1484 from AlexsLemonade/dev
kurtwheeler Aug 13, 2019
54980c0
Merge pull request #1489 from AlexsLemonade/dev
kurtwheeler Aug 14, 2019
eb86446
Merge pull request #1493 from AlexsLemonade/dev
kurtwheeler Aug 14, 2019
a37f174
Merge pull request #1495 from AlexsLemonade/dev
kurtwheeler Aug 15, 2019
72137d1
Merge pull request #1498 from AlexsLemonade/dev
kurtwheeler Aug 15, 2019
885c0c1
Merge pull request #1502 from AlexsLemonade/dev
kurtwheeler Aug 16, 2019
9dfb476
Merge pull request #1504 from AlexsLemonade/dev
kurtwheeler Aug 16, 2019
d65521e
Merge pull request #1506 from AlexsLemonade/dev
kurtwheeler Aug 17, 2019
53d18db
Merge pull request #1509 from AlexsLemonade/dev
kurtwheeler Aug 17, 2019
a790dbc
Merge pull request #1511 from AlexsLemonade/dev
kurtwheeler Aug 18, 2019
f10da1e
Merge pull request #1520 from AlexsLemonade/dev
kurtwheeler Aug 21, 2019
807c9cd
Merge pull request #1522 from AlexsLemonade/dev
kurtwheeler Aug 21, 2019
6211230
Merge pull request #1525 from AlexsLemonade/dev
kurtwheeler Aug 22, 2019
d74c4b1
Merge pull request #1527 from AlexsLemonade/dev
kurtwheeler Aug 22, 2019
93208ce
Merge pull request #1529 from AlexsLemonade/dev
kurtwheeler Aug 22, 2019
4f06ef7
Merge pull request #1531 from AlexsLemonade/dev
kurtwheeler Aug 23, 2019
aaf820b
Merge pull request #1533 from AlexsLemonade/dev
kurtwheeler Aug 23, 2019
ddd21b4
Merge pull request #1540 from AlexsLemonade/dev
kurtwheeler Aug 27, 2019
d8055e9
Merge pull request #1545 from AlexsLemonade/dev
kurtwheeler Aug 29, 2019
9d8904d
Merge pull request #1550 from AlexsLemonade/dev
kurtwheeler Aug 29, 2019
9ba208a
Merge pull request #1553 from AlexsLemonade/dev
kurtwheeler Aug 29, 2019
8957478
Merge pull request #1574 from AlexsLemonade/dev
kurtwheeler Sep 9, 2019
671ee50
Merge pull request #1577 from AlexsLemonade/dev
kurtwheeler Sep 9, 2019
a082baa
Merge pull request #1579 from AlexsLemonade/dev
kurtwheeler Sep 10, 2019
55c6e95
Merge pull request #1581 from AlexsLemonade/dev
kurtwheeler Sep 10, 2019
0413573
Merge pull request #1584 from AlexsLemonade/dev
kurtwheeler Sep 10, 2019
16f6393
Merge pull request #1586 from AlexsLemonade/dev
kurtwheeler Sep 11, 2019
c970731
Merge pull request #1590 from AlexsLemonade/dev
kurtwheeler Sep 11, 2019
b98d1af
Merge pull request #1593 from AlexsLemonade/dev
kurtwheeler Sep 12, 2019
2281d98
Merge pull request #1598 from AlexsLemonade/dev
kurtwheeler Sep 13, 2019
0529808
Merge pull request #1600 from AlexsLemonade/dev
kurtwheeler Sep 13, 2019
ad3cdcc
Merge pull request #1613 from AlexsLemonade/dev
kurtwheeler Sep 17, 2019
fd3ed97
Merge pull request #1623 from AlexsLemonade/dev
kurtwheeler Sep 19, 2019
147ba64
Merge pull request #1634 from AlexsLemonade/dev
kurtwheeler Sep 20, 2019
58a60c7
Merge pull request #1636 from AlexsLemonade/dev
kurtwheeler Sep 20, 2019
6ef82e6
Merge pull request #1641 from AlexsLemonade/dev
arielsvn Sep 23, 2019
2d5c959
Merge pull request #1645 from AlexsLemonade/dev
kurtwheeler Sep 24, 2019
246fe50
Merge pull request #1648 from AlexsLemonade/dev
kurtwheeler Sep 25, 2019
54d5557
Merge pull request #1651 from AlexsLemonade/dev
kurtwheeler Sep 25, 2019
0332147
Merge pull request #1653 from AlexsLemonade/dev
kurtwheeler Sep 25, 2019
d01df50
Merge pull request #1655 from AlexsLemonade/dev
kurtwheeler Sep 25, 2019
767ef18
Merge pull request #1659 from AlexsLemonade/dev
kurtwheeler Sep 25, 2019
1c4fad2
Merge pull request #1663 from AlexsLemonade/dev
kurtwheeler Sep 26, 2019
0a307a8
Merge pull request #1668 from AlexsLemonade/dev
kurtwheeler Sep 26, 2019
cc7ffa6
Merge pull request #1674 from AlexsLemonade/dev
kurtwheeler Sep 26, 2019
92741af
Merge pull request #1677 from AlexsLemonade/dev
kurtwheeler Sep 26, 2019
a17ba68
Merge pull request #1681 from AlexsLemonade/dev
kurtwheeler Sep 27, 2019
7855f90
Merge pull request #1685 from AlexsLemonade/dev
kurtwheeler Sep 27, 2019
5e8f03f
Merge pull request #1702 from AlexsLemonade/dev
arielsvn Sep 30, 2019
19e91b3
Merge pull request #1709 from AlexsLemonade/dev
arielsvn Oct 1, 2019
5f553ed
Merge pull request #1711 from AlexsLemonade/dev
arielsvn Oct 1, 2019
056544a
Merge pull request #1713 from AlexsLemonade/dev
kurtwheeler Oct 2, 2019
5682449
Merge pull request #1715 from AlexsLemonade/dev
kurtwheeler Oct 2, 2019
e52617f
Merge pull request #1719 from AlexsLemonade/dev
kurtwheeler Oct 2, 2019
53bc12e
Merge pull request #1721 from AlexsLemonade/dev
kurtwheeler Oct 2, 2019
f81c3ae
Merge pull request #1724 from AlexsLemonade/dev
kurtwheeler Oct 3, 2019
c515f13
Merge pull request #1729 from AlexsLemonade/dev
kurtwheeler Oct 3, 2019
e223e09
Merge pull request #1746 from AlexsLemonade/dev
kurtwheeler Oct 4, 2019
ee4e1ce
Merge pull request #1755 from AlexsLemonade/dev
kurtwheeler Oct 9, 2019
318ca72
Merge pull request #1761 from AlexsLemonade/dev
kurtwheeler Oct 10, 2019
0a849e1
Merge pull request #1763 from AlexsLemonade/dev
kurtwheeler Oct 11, 2019
edfe1f0
Merge pull request #1765 from AlexsLemonade/dev
kurtwheeler Oct 11, 2019
49c196c
Merge pull request #1767 from AlexsLemonade/dev
kurtwheeler Oct 11, 2019
08afa7c
Merge pull request #1769 from AlexsLemonade/dev
kurtwheeler Oct 11, 2019
505766b
Merge pull request #1777 from AlexsLemonade/dev
kurtwheeler Oct 14, 2019
c66c83c
Merge pull request #1779 from AlexsLemonade/dev
kurtwheeler Oct 15, 2019
bb66080
Merge pull request #1785 from AlexsLemonade/dev
kurtwheeler Oct 15, 2019
f1bc015
Merge pull request #1789 from AlexsLemonade/dev
kurtwheeler Oct 16, 2019
305b1a5
Merge pull request #1791 from AlexsLemonade/dev
kurtwheeler Oct 16, 2019
64463d4
Merge pull request #1793 from AlexsLemonade/dev
kurtwheeler Oct 16, 2019
f281016
Merge pull request #1798 from AlexsLemonade/dev
kurtwheeler Oct 16, 2019
78c2014
Merge pull request #1802 from AlexsLemonade/dev
kurtwheeler Oct 17, 2019
2acbb3d
Merge pull request #1805 from AlexsLemonade/dev
kurtwheeler Oct 18, 2019
78ce24a
Merge pull request #1807 from AlexsLemonade/dev
kurtwheeler Oct 19, 2019
ce97534
Merge pull request #1810 from AlexsLemonade/dev
kurtwheeler Oct 19, 2019
bd535a2
Merge pull request #1815 from AlexsLemonade/dev
davidsmejia Oct 21, 2019
9928600
Merge pull request #1821 from AlexsLemonade/dev
kurtwheeler Oct 22, 2019
e627316
Merge pull request #1827 from AlexsLemonade/dev
kurtwheeler Oct 23, 2019
9aa1f6f
Merge pull request #1834 from AlexsLemonade/dev
kurtwheeler Oct 25, 2019
493b995
Merge pull request #1845 from AlexsLemonade/dev
arielsvn Oct 30, 2019
0981e13
Merge pull request #1862 from AlexsLemonade/dev
arielsvn Nov 4, 2019
dc09bd7
Merge pull request #1864 from AlexsLemonade/dev
arielsvn Nov 4, 2019
738d32d
Merge pull request #1884 from AlexsLemonade/dev
kurtwheeler Nov 11, 2019
ed68fe1
Merge pull request #1886 from AlexsLemonade/dev
kurtwheeler Nov 11, 2019
7d29bef
Merge pull request #1892 from AlexsLemonade/dev
kurtwheeler Nov 12, 2019
a7f1270
Merge pull request #1894 from AlexsLemonade/dev
kurtwheeler Nov 12, 2019
720d77a
Merge pull request #1896 from AlexsLemonade/dev
kurtwheeler Nov 12, 2019
32fffb1
Merge pull request #1912 from AlexsLemonade/dev
kurtwheeler Nov 15, 2019
a2d3d79
Merge pull request #1919 from AlexsLemonade/dev
kurtwheeler Nov 18, 2019
792d678
Merge pull request #1921 from AlexsLemonade/dev
kurtwheeler Nov 18, 2019
5186ca9
Merge pull request #1926 from AlexsLemonade/dev
arielsvn Nov 19, 2019
a620266
Merge pull request #1932 from AlexsLemonade/dev
kurtwheeler Nov 20, 2019
d8c31c3
Merge pull request #1936 from AlexsLemonade/dev
kurtwheeler Nov 21, 2019
c248bae
Merge pull request #1938 from AlexsLemonade/dev
kurtwheeler Nov 22, 2019
e664ccd
Merge pull request #1946 from AlexsLemonade/dev
kurtwheeler Nov 25, 2019
ca31a67
Merge pull request #1948 from AlexsLemonade/dev
kurtwheeler Nov 25, 2019
d752dae
Merge pull request #1956 from AlexsLemonade/dev
kurtwheeler Nov 26, 2019
1331d89
Merge pull request #1960 from AlexsLemonade/dev
kurtwheeler Nov 27, 2019
3890a59
Merge pull request #1961 from AlexsLemonade/dev
kurtwheeler Nov 27, 2019
413e350
Merge pull request #1964 from AlexsLemonade/dev
davidsmejia Nov 28, 2019
24e10e0
Merge pull request #1966 from AlexsLemonade/dev
kurtwheeler Dec 2, 2019
f51d41f
Merge pull request #1971 from AlexsLemonade/dev
kurtwheeler Dec 4, 2019
0293aaa
Merge pull request #1974 from AlexsLemonade/dev
kurtwheeler Dec 4, 2019
2b8ff5f
Merge pull request #1983 from AlexsLemonade/dev
kurtwheeler Dec 5, 2019
0358e0b
Merge pull request #1985 from AlexsLemonade/dev
kurtwheeler Dec 6, 2019
ad67001
Merge pull request #1987 from AlexsLemonade/dev
arielsvn Dec 12, 2019
adc0278
Merge pull request #1990 from AlexsLemonade/dev
arielsvn Dec 12, 2019
71e9c8a
Merge pull request #1994 from AlexsLemonade/dev
arielsvn Dec 12, 2019
21a14da
Merge pull request #1999 from AlexsLemonade/dev
arielsvn Dec 13, 2019
d40c168
Merge pull request #2002 from AlexsLemonade/dev
arielsvn Dec 16, 2019
7783f66
Merge pull request #2016 from AlexsLemonade/dev
davidsmejia Dec 26, 2019
601f8bd
Merge pull request #2028 from AlexsLemonade/dev
arielsvn Dec 27, 2019
9515bc0
Merge pull request #2039 from AlexsLemonade/dev
arielsvn Jan 3, 2020
21947f1
Merge pull request #2063 from AlexsLemonade/dev
kurtwheeler Jan 10, 2020
ded5a81
Merge pull request #2093 from AlexsLemonade/dev
arielsvn Jan 23, 2020
80926bb
Merge pull request #2099 from AlexsLemonade/dev
kurtwheeler Jan 27, 2020
c1c0d5f
Merge pull request #2102 from AlexsLemonade/dev
kurtwheeler Jan 27, 2020
eb39bc9
Merge pull request #2107 from AlexsLemonade/dev
kurtwheeler Jan 28, 2020
0b1642a
Merge pull request #2113 from AlexsLemonade/dev
kurtwheeler Jan 30, 2020
bc808a0
Merge pull request #2116 from AlexsLemonade/dev
kurtwheeler Jan 30, 2020
f807442
Merge pull request #2118 from AlexsLemonade/dev
kurtwheeler Jan 31, 2020
7717ffb
Merge pull request #2126 from AlexsLemonade/dev
arielsvn Feb 3, 2020
28ca229
Merge pull request #2151 from AlexsLemonade/dev
kurtwheeler Feb 24, 2020
1eca4f6
Merge pull request #2155 from AlexsLemonade/dev
kurtwheeler Feb 24, 2020
7702218
Revert "[DEPLOY] Fix pgbouncer/RDS remaining issues"
kurtwheeler Feb 25, 2020
0ff36ad
Merge pull request #2156 from AlexsLemonade/revert-2155-dev
kurtwheeler Feb 25, 2020
80e1e44
Revert "[DEPLOY] Fix for microbes and GSE75083"
kurtwheeler Feb 25, 2020
c51ca1c
Merge pull request #2157 from AlexsLemonade/revert-2151-dev
kurtwheeler Feb 25, 2020
66748ef
Revert "Revert "[DEPLOY] Fix for microbes and GSE75083""
kurtwheeler Feb 25, 2020
4d7f834
Merge pull request #2161 from AlexsLemonade/revert-2157-revert-2151-dev
kurtwheeler Feb 25, 2020
2557c8f
Revert "Revert "[DEPLOY] Fix pgbouncer/RDS remaining issues""
kurtwheeler Feb 25, 2020
bcd5c38
Merge pull request #2162 from AlexsLemonade/revert-2156-revert-2155-dev
kurtwheeler Feb 25, 2020
b47923f
Merge pull request #2163 from AlexsLemonade/dev
kurtwheeler Feb 25, 2020
a8476ae
Merge pull request #2166 from AlexsLemonade/dev
kurtwheeler Feb 26, 2020
28c98de
Merge pull request #2169 from AlexsLemonade/dev
kurtwheeler Feb 27, 2020
1235c1a
Merge pull request #2201 from AlexsLemonade/dev
kurtwheeler Mar 24, 2020
da9a344
Merge pull request #2203 from AlexsLemonade/dev
kurtwheeler Mar 24, 2020
e8a8ad3
added imputed sex microarray files
erflynn Mar 20, 2020
e43bd22
added readme and cleaned metadata
erflynn Mar 25, 2020
d6140ac
Update config/externally_supplied_metadata/README.md
erflynn Apr 1, 2020
732cab5
Update config/externally_supplied_metadata/README.md
erflynn Apr 1, 2020
7adb9ce
fixed duplicate issue with cleaned metadata files
erflynn Apr 1, 2020
35702dc
Merge branch 'dev' of https://github.com/erflynn/refinebio into dev
erflynn Apr 1, 2020
715ab52
added RNA-seq sex labels
erflynn Apr 1, 2020
d9145d3
updated microarray to normalized compendia to match refine-bio termin…
erflynn Apr 1, 2020
4ead408
Update config/externally_supplied_metadata/README.md
erflynn Apr 7, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions config/externally_supplied_metadata/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
## Externally supplied metadata
#### E Flynn
#### Last updated 3/25/2020

All code to produce this update is included in the [sl_label repository](https://github.com/erflynn/sl_label repository)
erflynn marked this conversation as resolved.
Show resolved Hide resolved

This update includes imputed sex labels for microarray data (mouse, rat, and human).

The majority of gene expression data is missing metadata sex labels (see Table 1). This lack of labels prevents us from examining the breakdown by sex of many studies. We used the expression of X and Y chromosome genes and metadata sex labels to train a logistic regression model (with elastic net penalty) to predict sample sex. Across all three organisms, our models achieve approximately 95% accuracy in a randomly selected held-out test set as compared to the metadata labels. Additionally, we assessed the accuracy of our model, on various subsets of the data; comparing to all metadata sex labels (agreement 93.5-94.8%), a random sample of single sex studies (agreement 92.6-96.5%), and, in human, manually annotated sex labels from a previous analysis (94.2%) (see Table 2).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a link to a previous analysis that you would be comfortable putting here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops! missing citation here, added in. apologies!


| organism | Samples (n) | Samples missing metadata annotation | Studies (n) | Studies missing metadata annotation |
| ----- | ---- | ---- | ---- | ---- |
| human | 430119 | 74.90% | 14987 | 87.30% |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was noticing the line counts for the CSV files - I would expect for human_rnaseq + human_microarray not to exceed the number Samples (n) in this table. I took a closer look at the tabular data in the CSV files and it looks like there are duplicate values in the acc column and in the microarray file, there are run accessions that are consistent with RNA-seq data (e.g., DRR, ERR). There are about 99k accessions shared across the human RNA-seq and microarray files but looking at the human RNA-seq file there are no GEO sample accessions (e.g., GSM). This may totally be intentional, but it is a bit different than what I would expect given the context in this document.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re the csv files - you are right, there are duplicates, apologies here! I have updated these, and put in a commit that fixes that. The numbers now exactly match those in the table.

For the run accessions, these are exactly the set of accessions that are included in the aggregated_metadata.json files in the respective compendia (the normalized microarray and then the RNA-seq compendia). You are right that there is some overlap -- I will look into this more -- but the accessions are the same. Is there a different way that you usually organize this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the run accessions, these are exactly the set of accessions that are included in the aggregated_metadata.json files in the respective compendia (the normalized microarray and then the RNA-seq compendia). You are right that there is some overlap -- I will look into this more -- but the accessions are the same.

Ah okay I see! Then I would expect some overlap because the normalized compendium contains both microarray and RNA-seq data. So maybe it would be better to call these files normalized and rnaseq_sample rather than microarray and rnaseq to match the terminology here: https://www.refine.bio/compendia.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok! I will update this, thank you!

| mouse | 228707 | 74.00% | 12995 | 80.90% |
| rat | 31361 | 55.50% | 1295 | 65.30% |
Table 1. Metadata missingness for sex labels.


| dataset | Human | Mouse | Rat |
| ----- | ---- | ---- | ---- |
| Training (n=1400) | 95.60% | 96.10% | 98.60% |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect there might be more information about the training and testing set in the https://github.com/erflynn/sl_label about things like class balance. Can we add a link to that information in the paragraph above that begins with The majority of gene expression data is missing metadata sex labels please?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some information! thanks for the suggestion

| Testing (n=600) | 95.20% | 95.70% | 95.50% |
| Metadata | 93.5% (107748) | 93.5% (58473) | 94.8% (13995)|
| High-confidence | 97.3% (7301) | 95.5% (3968) | n/a |
| Single sex - f | 96.5% (12919) | 93.4% (13243) | 96.3% (2240) |
| Single sex - m | 92.6% (8128) | 93.6% (30225) | 95.8% (12689) |
| Manual annotations | 94.2% (8289) | n/a | n/a |

Table 2. Concordance of sex labels. Numbers in parentheses indicate the total number of samples, percentages the number of samples that agree divided by the total number of samples. High confidence labels have matching metadata and clustering based expression labels.


The cleaned metadata sex labels are also included in the `cleaned_metadata/` directory for microarray and RNA-seq. This process mapped all harmonized sex labels to "male", "female", "mixed", or "unknown". Code for this is included in the `sl_label` repository under `code/01_metadata/`.
erflynn marked this conversation as resolved.
Show resolved Hide resolved
Loading