Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Step target example broken #983

Closed
gaow opened this issue Jun 15, 2018 · 29 comments
Closed

Step target example broken #983

gaow opened this issue Jun 15, 2018 · 29 comments

Comments

@gaow
Copy link
Member

gaow commented Jun 15, 2018

Separated from a private correspondence: the RNA-seq DE example workflow now is broken. To reproduce:

cd sos-docs/src/examples
sos run RNASeqDE.ipynb align -j 2 --samples SRR1039508

Error message:

ERROR: Multiple steps hg19_reference_1, hg19_reference_2, hg19_reference_3 to generate target sos_step("hg19_reference")

This used to work, though. I guess maybe it is similar to issue in #981 introduced due to recent changes of target dependency codes.

@BoPeng
Copy link
Contributor

BoPeng commented Jun 15, 2018

In this example, you have

[hg19_reference_1 (download)]
[hg19_reference_2 (decompress hg19.fa)]
[hg19_reference_3 (gene annotations)]

and then

depends: sos_step('hg19_reference')

so are you depending on a subworkflow?

@gaow
Copy link
Member Author

gaow commented Jun 15, 2018

Yes. I wonder if this is okay, because I believe it used to work.

@BoPeng
Copy link
Contributor

BoPeng commented Jun 15, 2018

I never knew it worked, so I would say it was an undefined behavior and we need to formalize now. It is like triggering a subworkflow from dependents. I mean, if you can do

[sub_1]
[sub_2]

[step]
depends: sos_step('sub')

instead of

[sub_1]
[sub_2]

[step]
sos_run('sub')

why would not we allow formal workflow here?

[A_1]
[A_2]
[B_1]

[step]
depends: sos_step('A+B')

@gaow
Copy link
Member Author

gaow commented Jun 15, 2018

Sorry I'm a bit confused. Is this a proposal? It seems reasonable, if not difficult ... right?

@BoPeng
Copy link
Contributor

BoPeng commented Jun 15, 2018

I do not know how difficult it is. I am just saying that the behavior was undefined before and triggers an error now, and if we are to support it, we are essentially allowing the triggering of an entire workflow, not just sos_step, and then if we are triggering workflow, sos_step('A+B') makes sense, then the name sos_step might be inappropriate.

@gaow
Copy link
Member Author

gaow commented Jun 15, 2018

I see. I'm okay with the name sos_step though if we decide to support it.

But do we want to support it? It seems an natural extension and feature; but if it is not a robust feature and can cause potential problems we then would rather not have it? Currently we do not have an immediate need for it -- in the RNA-seq DE example I wanted to demonstrate the feature. Though I can also try to use file targets, the logic seems more intuitive with this dependency on the mini-workflow. What do you think?

@BoPeng
Copy link
Contributor

BoPeng commented Jun 15, 2018

In our manuscript, I specifically said something like: the trunk of the workflow can be outcome based where process-oriented subworkflows are triggered ... So conceptually speaking this is what we want to do, and we do not have an explicit syntax for it.

@gaow
Copy link
Member Author

gaow commented Jun 15, 2018

Indeed ... I guess we should formalize it. I think you've make it clear about the behavior, but we need new syntax like workflow_target() because we have file_target()? Or sos_workflow() similar to sos_variable()?

@BoPeng
Copy link
Contributor

BoPeng commented Jun 15, 2018

Yes, sos_workflow() is clearer than sos_step(workflow), but sos_step has caused enough trouble and I do not know if I want to stab into sos_workflow now.

BoPeng pushed a commit that referenced this issue Jun 15, 2018
@BoPeng
Copy link
Contributor

BoPeng commented Jun 15, 2018

a does not match a_1 etc now. The loophole was for a_0 as far as I remember but now a would not even match to a_0.

@BoPeng
Copy link
Contributor

BoPeng commented Jun 15, 2018

In your case

[hg19_reference_1 (download)]
[hg19_reference_2 (decompress hg19.fa)]
[hg19_reference_3 (gene annotations)]

[default]
depends: sos_step('hg19_reference')

would not work now, but

[hg19_reference_1 (download)]
[hg19_reference_2 (decompress hg19.fa)]
[hg19_reference_3 (gene annotations)]

[default]
depends: sos_step('hg19_reference_1'), sos_step('hg19_reference_2'), sos_step('hg19_reference_3')

would work. I am wondering if we can allow something like

[hg19_reference_1 (download)]
[hg19_reference_2 (decompress hg19.fa)]
[hg19_reference_3 (gene annotations)]

[default]
depends: sos_step('hg19_reference_*')

It is conceptually not a workflow, but multiple sos_step specified using a wildcard character.

@gaow
Copy link
Member Author

gaow commented Jun 15, 2018

I agree with not messing too much with the code at this moment until there is a real need. For mini-workflow, one can always do

[C]
sos_run('A+B')

[default]
depends: sos_step('C')

right? then it should be good to claim handling mini-workflows. But I think the _* case is one particular exception that worth a special handling. Because this is what people might often do -- the most common type of process-oriented subworkflow is one with numeric ordering. So i think it would be good to support sos_step('hg19_reference') directly and under the hood we expend it to match all steps.

@BoPeng
Copy link
Contributor

BoPeng commented Jun 15, 2018

There is a minor difference here: we claim the "connectedness" of workflow steps so in sos_run('a') we know that a_2 would inherit output from a_1. In sos_step('a_1'), sos_step)('a_2') (or sos_step('a') as you proposed), the steps are not connected and are executed as separate steps.

So in the end there might be a difference between

[C]
sos_run('A')

[default]
depends: sos_step('C')

and

[default]
depends: sos_step('A')

although the latter might just work in a majority of the cases.

@gaow
Copy link
Member Author

gaow commented Jun 15, 2018

Okay, but I guess if sos_step('hg19_reference_*') does not assume connectedness there might be more confusion than benefit, because my hg19_reference workflow does assume the steps involved follow numeric ordering ...

@gaow gaow closed this as completed Jun 15, 2018
@gaow gaow reopened this Jun 15, 2018
@BoPeng
Copy link
Contributor

BoPeng commented Jun 16, 2018

ok, I can ensure order so sos_step('a') would essentially be sos_workflow('a'), which actually make the latter unnecessary because sos_workflow('a+b') is rarely needed and users can use the extra-step trick if needed.

The problem with the implementation is that we are using separate DAGs for sos_run('a') but in this case the steps would be added to the master DAG, which assumes one numerically ordered forward-style trunk. Adding multiple numerically ordered mini-workflows would complicate the DAG building (but not necessarily difficult though).

@gaow
Copy link
Member Author

gaow commented Jun 16, 2018

Hmm now I guess my nested dependency has made things more complicated:

$ sos run RNASeqDE.ipynb align --samples SRR1039508

ERROR: Defined output fail to produce expected output: /home/gaow/GIT/LargeFiles/RNAseqDE/hg19/hg19.2bit /home/gaow/GIT/LargeFiles/RNAseqDE/hg19/twoBitToFa generated, /home/gaow/GIT/LargeFiles/RNAseqDE/hg19/genomeParameters.txt expected.

But sos run RNASeqDE.ipynb star now works as expected thanks to the latest patch. I'll try to run that step for now because I know this can take a few hours. Will test other steps after we sort this out.

@gaow
Copy link
Member Author

gaow commented Jun 16, 2018

Unfortunately, sos_step('a') does not work because it fails to connect the input / output. Here is MWE:

[hg_1]
output: "1.txt"
run:
  touch 1.txt

[hg_2]
output: "2.txt"
run: expand = True
   echo {_input[0]} > 2.txt

[default]
depends: sos_step('hg')

and it throws IndexError: list index out of range

@gaow
Copy link
Member Author

gaow commented Jun 16, 2018

Okay, now I get this error:

$ sos run RNASeqDE.ipynb star
INFO: download (index=0) is ignored due to saved signature
INFO: output:   /home/gaow/GIT/LargeFiles/RNAseqDE/hg19/hg19.2bit, /home/gaow/GIT/LargeFiles/RNAseqDE/hg19/twoBitToFa
INFO: decompress hg19.fa (index=0) is ignored due to saved signature
INFO: output:   /home/gaow/GIT/LargeFiles/RNAseqDE/hg19/hg19.fa
INFO: gene annotations (index=0) is ignored due to saved signature
INFO: output:   /home/gaow/GIT/LargeFiles/RNAseqDE/hg19/Homo_sapiens.GRCh38.91.gtf.gz
INFO: Target unavailable: sos_step("hg19_reference")
ERROR: Completed target sos_step("hg19_reference") is being re-executed. Please report this bug to SoS developers.

@BoPeng
Copy link
Contributor

BoPeng commented Jun 16, 2018

Trying to figure out how to trigger this bug from the test case

[a_1]
output: "a_1"
sh:
  echo whatever > a_1

[a_2]
output: "a_2"
sh: expand=True
  cp {_input} {_output}

[default]
depends: sos_step('a')

@BoPeng
Copy link
Contributor

BoPeng commented Jun 16, 2018

It was caused by the _ inside hg19_reference. Fixed now.

@gaow
Copy link
Member Author

gaow commented Jun 16, 2018

At this point I was able to test and verify separately the 2 commands:

sos run RNASeqDE.ipynb star
sos run RNASeqDE.ipynb align

but

sos run RNASeqDE.ipynb align

by itself still fails. See this thread for details: #983 (comment)

@BoPeng
Copy link
Contributor

BoPeng commented Jun 16, 2018

I see

$ sos run RNASeqDE.ipynb star
INFO: download (index=0) is ignored due to saved signature
INFO: output:   /Users/bpeng1/GIT/LargeFiles/RNAseqDE/hg19/hg19.2bit, /Users/bpeng1/GIT/LargeFiles/RNAseqDE/hg19/twoBitToFa
INFO: decompress hg19.fa (index=0) is ignored due to saved signature
INFO: output:   /Users/bpeng1/GIT/LargeFiles/RNAseqDE/hg19/hg19.fa
INFO: gene annotations (index=0) is ignored due to saved signature
INFO: output:   /Users/bpeng1/GIT/LargeFiles/RNAseqDE/hg19/Homo_sapiens.GRCh38.91.gtf.gz
STAR --runMode genomeGenerate \
   --genomeDir /Users/bpeng1/GIT/LargeFiles/RNAseqDE/hg19 \
   --genomeFastaFiles /Users/bpeng1/GIT/LargeFiles/RNAseqDE/hg19/hg19.fa \
   --sjdbGTFtagExonParentTranscript /Users/bpeng1/GIT/LargeFiles/RNAseqDE/hg19/Homo_sapiens.GRCh38.91.gtf.gz \
   --runThreadN 4
Jun 16 14:13:27 ..... started STAR run
Jun 16 14:13:27 ... starting to generate Genome files

EXITING because of INPUT ERROR: could not open genomeFastaFile: /Users/bpeng1/GIT/LargeFiles/RNAseqDE/hg19/hg19.fa

Jun 16 14:13:27 ...... FATAL ERROR, exiting
ERROR: [star]: Executing script in docker returns an error (exitcode=104).
The script has been saved to /Users/bpeng1/sos/sos-docs/src/examples/.sos/docker_run_76367.sh. To reproduce the error please run:
docker run --rm   -v /private/tmp:/private/tmp -v /Users/bpeng1/sos/sos-docs/src/examples/.sos/docker_run_76367.sh:/var/lib/sos/docker_run_76367.sh    -t -P -w=/private/tmp -u 1985961928:895809667    gaow/debian-ngs /bin/bash -ev /var/lib/sos/docker_run_76367.sh

I suppose this is because the Large file stuff is outside of the current working directory, right? Note that we do not mount home by default now.

@gaow
Copy link
Member Author

gaow commented Jun 16, 2018

@BoPeng yes. and I already updated the notebook to reflect the change, and added a note. Please pull the latest version.

currently the star pipeline by itself works though you are welcome to test it. The issue is, that we'd want that step to be triggered by align so people only have to type one command. It is designed that way, but not working now.

@BoPeng
Copy link
Contributor

BoPeng commented Jun 16, 2018

On my machine, I got the error for 2G ram, then also for 11G, and I am increaging to 20G. The error message is clear enough though.

ERROR: [star]: Script killed by docker, probably because of lack of RAM (available RAM=11.2GB, exitcode=137). The script has been saved to /Users/bpeng1/sos/sos-docs/src/examples/.sos/docker_run_82344.sh. To reproduce the error please run:

@gaow
Copy link
Member Author

gaow commented Jun 16, 2018

Yes, I have made a note that it requires 30g+ memory. My computer is 64G. This example is meant to be reproduced so I didn't make it external task. The other RNA seq example is external task with 96G memory requirement not meant to be reproduced.

@BoPeng
Copy link
Contributor

BoPeng commented Jun 16, 2018

Come on, I only ordered 32G of RAM for my new mac and thought that it would be enough even for virtual machines... Now I can not run the star step.

@gaow
Copy link
Member Author

gaow commented Jun 16, 2018

Well I guess that is why many people want to use tophat instead. But there is a sparse mode in STAR that will half the memory. I believe I have implemented its interface in this notebook too. I assume users can reserve an interactive cluster node for the full version though? I will make a note

@gaow
Copy link
Member Author

gaow commented Jun 16, 2018

BTW there should be no need to use VM for this one because I think most tools are dockerized. Maybe it fits if you don't use VM? My desktop is a $2600 Linux machine of 40 threads 64G memory ...

@BoPeng
Copy link
Contributor

BoPeng commented Jun 16, 2018

$ sos run RNASeqDE.ipynb align --samples SRR1039508

should be fixed.

@gaow gaow closed this as completed Jun 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants