Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding imputed sex labels #2207
base: dev
Are you sure you want to change the base?
Adding imputed sex labels #2207
Changes from 153 commits
bb9f5fb
e861290
8c2b3c6
d561189
2affb92
e58d12e
0958459
161f639
77b3b3d
165e5b5
ab826cb
aec7673
a312fb9
759fa61
5de1a74
54980c0
eb86446
a37f174
72137d1
885c0c1
9dfb476
d65521e
53d18db
a790dbc
f10da1e
807c9cd
6211230
d74c4b1
93208ce
4f06ef7
aaf820b
ddd21b4
d8055e9
9d8904d
9ba208a
8957478
671ee50
a082baa
55c6e95
0413573
16f6393
c970731
b98d1af
2281d98
0529808
ad3cdcc
fd3ed97
147ba64
58a60c7
6ef82e6
2d5c959
246fe50
54d5557
0332147
d01df50
767ef18
1c4fad2
0a307a8
cc7ffa6
92741af
a17ba68
7855f90
5e8f03f
19e91b3
5f553ed
056544a
5682449
e52617f
53bc12e
f81c3ae
c515f13
e223e09
ee4e1ce
318ca72
0a849e1
edfe1f0
49c196c
08afa7c
505766b
c66c83c
bb66080
f1bc015
305b1a5
64463d4
f281016
78c2014
2acbb3d
78ce24a
ce97534
bd535a2
9928600
e627316
9aa1f6f
493b995
0981e13
dc09bd7
738d32d
ed68fe1
7d29bef
a7f1270
720d77a
32fffb1
a2d3d79
792d678
5186ca9
a620266
d8c31c3
c248bae
e664ccd
ca31a67
d752dae
1331d89
3890a59
413e350
24e10e0
f51d41f
0293aaa
2b8ff5f
0358e0b
ad67001
adc0278
71e9c8a
21a14da
d40c168
7783f66
601f8bd
9515bc0
21947f1
ded5a81
80926bb
c1c0d5f
eb39bc9
0b1642a
bc808a0
f807442
7717ffb
28ca229
1eca4f6
7702218
0ff36ad
80e1e44
c51ca1c
66748ef
4d7f834
2557c8f
bcd5c38
b47923f
a8476ae
28c98de
1235c1a
da9a344
e8a8ad3
e43bd22
d6140ac
732cab5
7adb9ce
35702dc
715ab52
d9145d3
4ead408
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a link to
a previous analysis
that you would be comfortable putting here?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops! missing citation here, added in. apologies!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was noticing the line counts for the CSV files - I would expect for
human_rnaseq
+human_microarray
not to exceed the numberSamples (n)
in this table. I took a closer look at the tabular data in the CSV files and it looks like there are duplicate values in theacc
column and in the microarray file, there are run accessions that are consistent with RNA-seq data (e.g.,DRR
,ERR
). There are about 99k accessions shared across the human RNA-seq and microarray files but looking at the human RNA-seq file there are no GEO sample accessions (e.g.,GSM
). This may totally be intentional, but it is a bit different than what I would expect given the context in this document.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re the csv files - you are right, there are duplicates, apologies here! I have updated these, and put in a commit that fixes that. The numbers now exactly match those in the table.
For the run accessions, these are exactly the set of accessions that are included in the
aggregated_metadata.json
files in the respective compendia (the normalized microarray and then the RNA-seq compendia). You are right that there is some overlap -- I will look into this more -- but the accessions are the same. Is there a different way that you usually organize this?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay I see! Then I would expect some overlap because the normalized compendium contains both microarray and RNA-seq data. So maybe it would be better to call these files
normalized
andrnaseq_sample
rather thanmicroarray
andrnaseq
to match the terminology here: https://www.refine.bio/compendia.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok! I will update this, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect there might be more information about the training and testing set in the https://github.com/erflynn/sl_label about things like class balance. Can we add a link to that information in the paragraph above that begins with
The majority of gene expression data is missing metadata sex labels
please?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some information! thanks for the suggestion