-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very very slow processing of samples #537
Comments
Thank you for the report. Can you share one of the samples where you see this issue? Also, out of curiosity, was the other machine you tried on also OSX, or was it a linux machine? P.S. One other thing that does seem very strange to me is this line:
That is 19.6 billion alignments that were scored and discarded. |
Hi Rob,
Thanks for the quick response. The other computer was OSX, should I try a linux machine?
Here are some dropbox links to two of the files. I believe this is the set for the logs I posted.
https://www.dropbox.com/s/8c3p9hmgmadgj89/31_1.trimmed.fastq.gz?dl=0
https://www.dropbox.com/s/2y9jfvaphe9h21x/31_2.trimmed.fastq.gz?dl=0
Thanks,
Ryan
On Jun 15, 2020, at 9:32 PM, Rob Patro <notifications@github.com> wrote:
Thank you for the report. Can you share one of the samples where you see this issue? Also, out of curiosity, was the other machine you tried on also OSX, or was it a linux machine?
—
You are receiving this because you authored the thread.
Reply to this email directly,
view it on GitHub, or
unsubscribe.
|
Thanks @shalercr, I'll grab those files once they are finished uploading. I don't know if all samples show similar behavior, but these are called Regarding OSX vs. linux, it should not really matter, obviously (salmon should work well under both, but I'm just curious since this is obviously atypical and unexpected behavior). --Rob |
I processed the data above using the latest salmon from the develop branch
(the release candidate for 1.3.0), and I got the following time (used 8
threads, so timing is not directly comparable).
```
4604.57user 43.64system 9:24.15elapsed 823%CPU
```
The whole log is
```
[2020-06-15 23:51:54.747] [jointLog] [info] setting maxHashResizeThreads to
8
[2020-06-15 23:51:54.747] [jointLog] [info] Fragment incompatibility prior
below threshold. Incompatible fragments will be ignored.
[2020-06-15 23:51:54.747] [jointLog] [info] Usage of --validateMappings
implies use of minScoreFraction. Since not explicitly specified, it is
being set to 0.65
[2020-06-15 23:51:54.747] [jointLog] [info] Usage of --validateMappings
implies a default consensus slack of 0.2. Setting consensusSlack to 0.35.
[2020-06-15 23:51:54.747] [jointLog] [info] parsing read library format
[2020-06-15 23:51:54.747] [jointLog] [info] There is 1 library.
[2020-06-15 23:51:54.814] [jointLog] [info] Loading pufferfish index
[2020-06-15 23:51:54.814] [jointLog] [info] Loading dense pufferfish index.
[2020-06-15 23:51:55.555] [jointLog] [info] done
[2020-06-15 23:51:55.555] [jointLog] [info] Index contained 116,248 targets
[2020-06-15 23:51:55.588] [jointLog] [info] Number of decoys : 0
[2020-06-16 00:00:59.666] [jointLog] [info] Computed 344,764 rich
equivalence classes for further processing
[2020-06-16 00:00:59.666] [jointLog] [info] Counted 12,956,134 total reads
in the equivalence classes
[2020-06-16 00:00:59.673] [jointLog] [warning] 0.0736383% of fragments were
shorter than the k used to build the index.
If this fraction is too large, consider re-building the index with a
smaller k.
The minimum read size found was 1.
[2020-06-16 00:00:59.673] [jointLog] [info] Number of mappings discarded
because of alignment score : 134,091,887
[2020-06-16 00:00:59.673] [jointLog] [info] Number of fragments entirely
discarded because of alignment score : 2,429,390
[2020-06-16 00:00:59.673] [jointLog] [info] Number of fragments discarded
because they are best-mapped to decoys : 0
[2020-06-16 00:00:59.673] [jointLog] [info] Number of fragments discarded
because they have only dovetail (discordant) mappings to valid targets :
1,360,397
[2020-06-16 00:00:59.673] [jointLog] [info] Mapping rate = 45.4405%
[2020-06-16 00:00:59.673] [jointLog] [info] finished quantifyLibrary()
[2020-06-16 00:00:59.673] [jointLog] [info] Starting optimizer
[2020-06-16 00:00:59.792] [jointLog] [info] Marked 0 weighted equivalence
classes as degenerate
[2020-06-16 00:00:59.819] [jointLog] [info] iteration = 0 | max rel diff. =
8250.92
[2020-06-16 00:00:59.667] [fileLog] [info]
At end of round 0
==================
Observed 28512328 total fragments (28512328 in most recent round)
[2020-06-16 00:01:01.745] [jointLog] [info] iteration = 100 | max rel diff.
= 19.507
[2020-06-16 00:01:03.495] [jointLog] [info] iteration = 200 | max rel diff.
= 2.45489
[2020-06-16 00:01:05.225] [jointLog] [info] iteration = 300 | max rel diff.
= 3.459
[2020-06-16 00:01:06.968] [jointLog] [info] iteration = 400 | max rel diff.
= 4.38485
[2020-06-16 00:01:08.693] [jointLog] [info] iteration = 500 | max rel diff.
= 0.229724
[2020-06-16 00:01:10.426] [jointLog] [info] iteration = 600 | max rel diff.
= 0.242741
[2020-06-16 00:01:12.163] [jointLog] [info] iteration = 700 | max rel diff.
= 0.196958
[2020-06-16 00:01:14.015] [jointLog] [info] iteration = 800 | max rel diff.
= 0.239461
[2020-06-16 00:01:15.803] [jointLog] [info] iteration = 900 | max rel diff.
= 0.197651
[2020-06-16 00:01:17.095] [jointLog] [info] iteration = 969 | max rel diff.
= 0.00714824
[2020-06-16 00:01:17.131] [jointLog] [info] Finished optimizer
[2020-06-16 00:01:17.131] [jointLog] [info] writing output
```
So, you can note that the number of mappings discarded because of alignment
score is still quite big (134M) and, indeed, this took longer than I would
normally expect for
a sample with ~30M reads. However, it is _massively_ smaller (~116 times)
than the 15.6B in your run. Thus, it may have been the same issue as in
the other GitHub issue. You can try the pre-compiled linux binary I linked
there if you have access to a linux machine to test on. By the way, the
issue there was a bunch of reads plagued by super-repetitive homopolymer
segments, I haven't looked at these reads yet for evidence of that though.
…On Mon, Jun 15, 2020 at 11:09 PM shalercr ***@***.***> wrote:
Hi Rob,
Thanks for the quick response. The other computer was OSX, should I try a
linux machine?
Here are some dropbox links to two of the files. I believe this is the set
for the logs I posted.
https://www.dropbox.com/s/8c3p9hmgmadgj89/31_1.trimmed.fastq.gz?dl=0
https://www.dropbox.com/s/2y9jfvaphe9h21x/31_2.trimmed.fastq.gz?dl=0
Thanks,
Ryan
On Jun 15, 2020, at 9:32 PM, Rob Patro ***@***.***> wrote:
Thank you for the report. Can you share one of the samples where you see
this issue? Also, out of curiosity, was the other machine you tried on also
OSX, or was it a linux machine?
—
You are receiving this because you authored the thread.
Reply to this email directly,
view it on GitHub, or
unsubscribe.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#537 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACYH7UHOYB7KYKDASFB5RDRW3O7HANCNFSM4N7EOYSQ>
.
|
Hi Rob,
Thanks for the update. I’ll see about setting up a Linux box in the morning and trying the v1.3.1. I expected some reads to be discharged as this is a mixed intestinal sample so there is likely a lot of bacterial rna as we used rRNA depletion not polyA
selection. We were hoping to align to both the mouse genome and one of the bacteria species of interest.
Given the several orders of magnitude difference in discarded alignments between mine on 1.2.1 and your test run on 1.3.1, would you recommend I redo the whole dataset alignment on 1.3.1? If it runs even close to what you saw it shouldn’t take too long
to rerun.
Thanks again,
Ryan
Sent from my iPhone
On Jun 16, 2020, at 12:13 AM, Rob Patro <notifications@github.com> wrote:
I processed the data above using the latest salmon from the develop branch
(the release candidate for 1.3.0), and I got the following time (used 8
threads, so timing is not directly comparable).
```
4604.57user 43.64system 9:24.15elapsed 823%CPU
```
The whole log is
```
[2020-06-15 23:51:54.747] [jointLog] [info] setting maxHashResizeThreads to
8
[2020-06-15 23:51:54.747] [jointLog] [info] Fragment incompatibility prior
below threshold. Incompatible fragments will be ignored.
[2020-06-15 23:51:54.747] [jointLog] [info] Usage of --validateMappings
implies use of minScoreFraction. Since not explicitly specified, it is
being set to 0.65
[2020-06-15 23:51:54.747] [jointLog] [info] Usage of --validateMappings
implies a default consensus slack of 0.2. Setting consensusSlack to 0.35.
[2020-06-15 23:51:54.747] [jointLog] [info] parsing read library format
[2020-06-15 23:51:54.747] [jointLog] [info] There is 1 library.
[2020-06-15 23:51:54.814] [jointLog] [info] Loading pufferfish index
[2020-06-15 23:51:54.814] [jointLog] [info] Loading dense pufferfish index.
[2020-06-15 23:51:55.555] [jointLog] [info] done
[2020-06-15 23:51:55.555] [jointLog] [info] Index contained 116,248 targets
[2020-06-15 23:51:55.588] [jointLog] [info] Number of decoys : 0
[2020-06-16 00:00:59.666] [jointLog] [info] Computed 344,764 rich
equivalence classes for further processing
[2020-06-16 00:00:59.666] [jointLog] [info] Counted 12,956,134 total reads
in the equivalence classes
[2020-06-16 00:00:59.673] [jointLog] [warning] 0.0736383% of fragments were
shorter than the k used to build the index.
If this fraction is too large, consider re-building the index with a
smaller k.
The minimum read size found was 1.
[2020-06-16 00:00:59.673] [jointLog] [info] Number of mappings discarded
because of alignment score : 134,091,887
[2020-06-16 00:00:59.673] [jointLog] [info] Number of fragments entirely
discarded because of alignment score : 2,429,390
[2020-06-16 00:00:59.673] [jointLog] [info] Number of fragments discarded
because they are best-mapped to decoys : 0
[2020-06-16 00:00:59.673] [jointLog] [info] Number of fragments discarded
because they have only dovetail (discordant) mappings to valid targets :
1,360,397
[2020-06-16 00:00:59.673] [jointLog] [info] Mapping rate = 45.4405%
[2020-06-16 00:00:59.673] [jointLog] [info] finished quantifyLibrary()
[2020-06-16 00:00:59.673] [jointLog] [info] Starting optimizer
[2020-06-16 00:00:59.792] [jointLog] [info] Marked 0 weighted equivalence
classes as degenerate
[2020-06-16 00:00:59.819] [jointLog] [info] iteration = 0 | max rel diff. =
8250.92
[2020-06-16 00:00:59.667] [fileLog] [info]
At end of round 0
==================
Observed 28512328 total fragments (28512328 in most recent round)
[2020-06-16 00:01:01.745] [jointLog] [info] iteration = 100 | max rel diff.
= 19.507
[2020-06-16 00:01:03.495] [jointLog] [info] iteration = 200 | max rel diff.
= 2.45489
[2020-06-16 00:01:05.225] [jointLog] [info] iteration = 300 | max rel diff.
= 3.459
[2020-06-16 00:01:06.968] [jointLog] [info] iteration = 400 | max rel diff.
= 4.38485
[2020-06-16 00:01:08.693] [jointLog] [info] iteration = 500 | max rel diff.
= 0.229724
[2020-06-16 00:01:10.426] [jointLog] [info] iteration = 600 | max rel diff.
= 0.242741
[2020-06-16 00:01:12.163] [jointLog] [info] iteration = 700 | max rel diff.
= 0.196958
[2020-06-16 00:01:14.015] [jointLog] [info] iteration = 800 | max rel diff.
= 0.239461
[2020-06-16 00:01:15.803] [jointLog] [info] iteration = 900 | max rel diff.
= 0.197651
[2020-06-16 00:01:17.095] [jointLog] [info] iteration = 969 | max rel diff.
= 0.00714824
[2020-06-16 00:01:17.131] [jointLog] [info] Finished optimizer
[2020-06-16 00:01:17.131] [jointLog] [info] writing output
```
So, you can note that the number of mappings discarded because of alignment
score is still quite big (134M) and, indeed, this took longer than I would
normally expect for
a sample with ~30M reads. However, it is _massively_ smaller (~116 times)
than the 15.6B in your run. Thus, it may have been the same issue as in
the other GitHub issue. You can try the pre-compiled linux binary I linked
there if you have access to a linux machine to test on. By the way, the
issue there was a bunch of reads plagued by super-repetitive homopolymer
segments, I haven't looked at these reads yet for evidence of that though.
On Mon, Jun 15, 2020 at 11:09 PM shalercr ***@***.***> wrote:
Hi Rob,
Thanks for the quick response. The other computer was OSX, should I try a
linux machine?
Here are some dropbox links to two of the files. I believe this is the set
for the logs I posted.
https://www.dropbox.com/s/8c3p9hmgmadgj89/31_1.trimmed.fastq.gz?dl=0
https://www.dropbox.com/s/2y9jfvaphe9h21x/31_2.trimmed.fastq.gz?dl=0
Thanks,
Ryan
On Jun 15, 2020, at 9:32 PM, Rob Patro ***@***.***> wrote:
Thank you for the report. Can you share one of the samples where you see
this issue? Also, out of curiosity, was the other machine you tried on also
OSX, or was it a linux machine?
—
You are receiving this because you authored the thread.
Reply to this email directly,
view it on GitHub, or
unsubscribe.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#537 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACYH7UHOYB7KYKDASFB5RDRW3O7HANCNFSM4N7EOYSQ>
.
—
You are receiving this because you were mentioned.
Reply to this email directly,
view it on GitHub, or
unsubscribe.
|
Hi @shalercr, I agree that if each sample runs as quickly as you see here, it's not a big deal to just re-run them all. Note that, while my avoids ~116 times more (eventually-discarded) alignments than did yours, the difference in the number of discarded fragments (i.e. fragments that had no mapping) is very small — ~7,000, or 0.023% of the fragments. Thus, despite the huge speed-up, I wouldn't expect many differences in the quantifications. However, I couldn't argue against re-quantifying everything with the new version just for good measure :). When you have a chance, please do let me know if you see this same performance difference using the pre-compiled (linux) binary here. Of course, once we finalize this release (soon), we will push to bioconda so we will also have OSX binaries available. However, I just want to make sure this fixes things on your end as well before the release. |
Hey Rob,
I did manage to test v1.3 this evening. Ran much faster. The same sample that took about 6 hours ran in 45mins. Still not great, but I think it might be intrinsic to some of these samples, also I was running it off my laptop and was running Linux off a
flash drive so not an ideal setup. Either way much more reasonable. Do you want me to attach any logs or anything?
Best,
Ryan
…Sent from my iPhone
On Jun 16, 2020, at 10:20 AM, Rob Patro <notifications@github.com> wrote:
Hi
@shalercr,
I agree that if each sample runs as quickly as you see here, it's not a big deal to just re-run them all. Note that, while my avoids ~116 times more (eventually-discarded) alignments than did yours, the difference in the number of discarded
fragments (i.e. fragments that had no mapping) is very small — ~7,000, or 0.023% of the fragments. Thus, despite the huge speed-up, I wouldn't expect many differences in the quantifications. However, I couldn't argue against re-quantifying everything
with the new version just for good measure :). When you have a chance, please do let me know if you see this same performance difference using the pre-compiled (linux) binary
here. Of course, once we finalize this release (soon), we will push to bioconda so we will also have OSX binaries available. However, I just want to make sure this fixes things on your end as well before the release.
—
You are receiving this because you were mentioned.
Reply to this email directly,
view it on GitHub, or
unsubscribe.
|
Hi @shalercr, Thanks for reporting back! I agree that there are some challenging reads in these samples that are likely at the root of the slightly-longer-than-normal runtime. If you could upload the quant dir for this sample (that contains the logs), that would be useful. We (specifically, my student @mohsenzakeri, who is one of the main developers of the new selective-alignment algorithm) can poke around a bit to see if there is anything strange going on that can be characterized, but it might just be an inherent property of samples with very repetitive reads. Regardless, we'll be happy to take a look. Thanks! --Rob |
P.S. @shalercr, I also note that layering
and the number of mappings discarded alignments due to score comes down by another factor of ~6X. It might be worth seeing what you get with that option as well. |
Thanks for the heads up. I gave it a test this evening and wow, it is wicked fast. I’ll send you those quant files tomorrow when I get a chance, but adding that flag and the new version fixed the problem.
Thank you for all your help.
Best,
Ryan
…Sent from my iPhone
On Jun 16, 2020, at 11:36 PM, Rob Patro <notifications@github.com> wrote:
P.S.
@shalercr,
I also note that layering --hitFilterPolicy BOTH on top of the new version cuts down the time by another factor of 2 for me
2163.65user 12.72system 4:21.57elapsed 832%CPU (0avgtext+0avgdata 1221856maxresident)k
and the number of mappings discarded alignments due to score comes down by another factor of ~6X. It might be worth seeing what you get with that option as well.
—
You are receiving this because you were mentioned.
Reply to this email directly,
view it on GitHub, or
unsubscribe.
|
Hey Rob,
Sorry for the delay, here is a link to the quants folder if you guys are still interested. Everything worked well, especially with the additional flag. Any idea on the timeline for the bioconda release?
Best,
Ryan
https://www.dropbox.com/sh/rmy4f6brxx5iczo/AACxbyZFxN0XGcP3YRGjGO-pa?dl=0
On Jun 18, 2020, at 12:21 AM, Ryan, Shaler <shalercr@mcmaster.ca> wrote:
Thanks for the heads up. I gave it a test this evening and wow, it is wicked fast. I’ll send you those quant files tomorrow when I get a chance, but adding that flag and the new version fixed the problem.
Thank you for all your help.
Best,
Ryan
…Sent from my iPhone
On Jun 16, 2020, at 11:36 PM, Rob Patro <notifications@github.com> wrote:
P.S.
@shalercr,
I also note that layering --hitFilterPolicy BOTH on top of the new version cuts down the time by another factor of 2 for me
2163.65user 12.72system 4:21.57elapsed 832%CPU (0avgtext+0avgdata 1221856maxresident)k
and the number of mappings discarded alignments due to score comes down by another factor of ~6X. It might be worth seeing what you get with that option as well.
—
You are receiving this because you were mentioned.
Reply to this email directly,
view it on GitHub, or
unsubscribe.
|
1.3.0 had been released, with theses improvements and a few others you can read about in the release notes. Let is know if you have any further issues! |
Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?
salmon
Describe the bug
salmon takes days to process through 16 samples
To Reproduce
Tried running on a different computer, same issue.
Specifically, please provide at least the following information:
1.2.1
bioconda
mouse Mus_musculus.GRCm38.cdna.all.fa
fastq
Various
the --hitfilterpolicy BOTH was just added following reading another user query to see if that would help. (Much slower than previous versions #527)
Expected behavior
Not to be so slow.... I've used salmon previously with some SRR datasets and it was very fast, this seems very strange to me
Screenshots
Desktop (please complete the following information):
sw_vers
. If you are on linux the output ofuname -a
andlsb_release -a
]ProductName: Mac OS X
ProductVersion: 10.15.5
BuildVersion: 19F101
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: