Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workflow_template for remote workflow execution #1319

Closed
BoPeng opened this issue Nov 17, 2019 · 20 comments
Closed

workflow_template for remote workflow execution #1319

BoPeng opened this issue Nov 17, 2019 · 20 comments
Assignees

Comments

@BoPeng
Copy link
Contributor

BoPeng commented Nov 17, 2019

Right now all our task templates look something like

  job_template: |
            #!/bin/bash
            #PBS -N {task}
            #PBS -l nodes={nodes}:ppn={cores}
            #PBS -l walltime={walltime}
            #PBS -l mem={mem//10**9}GB
            #PBS -o ~/.sos/tasks/{task}.out
            #PBS -e ~/.sos/tasks/{task}.err
            sos execute {task} -v {verbosity} -s {sig_mode} -m {run_mode} 

but this makes it difficult to specify environments for sos execute. For example, if I would like to do

module load R/3.3

and

module load R/3.4

I would need to create two tasks queues, or change ~/.sos/hosts between the runs.

@BoPeng
Copy link
Contributor Author

BoPeng commented Nov 17, 2019

Another option is to allow the specification of templates. Right now the task template is named job_template but we could

  1. allow the definition of multiple templates and use template=template_name to specify the template to use.
  2. allow template=filename to specify the template to use.

The additional, potential benefit is to allow the use of templates for sos run -r. That is to say, if we allow for an template option, we can use

sos run -r host -? template_name_or_file

to execute the workflow with a template that use command sos run to run the entire workflow, instead of using sos execute to execute tasks.

@BoPeng
Copy link
Contributor Author

BoPeng commented Nov 17, 2019

I tent to like the second option because the exact command to submit the task (or workflow) is host-dependent, and would better be kept in config.yml or mytemplate.tpl, and should not be recorded in the notebook. It is often too long to be used nicely in the notebook anyway.

@BoPeng
Copy link
Contributor Author

BoPeng commented Nov 17, 2019

Similar to host_template, it could also be useful to execute certain commands after ssh to the host (e.g. module load sos).

@gaow
Copy link
Member

gaow commented Nov 17, 2019

Currently my solution to the problem is indeed to use different host names but defining the hosts are straightforward -- the additional host will be based_on the other one with only job_template being different. That's essentially like just having a separate template but in the same hosts.yml file.

to execute the workflow with a template that use command sos run to run the entire workflow

I'm not sure if this is appealing because this is only relevant when we run jobs on a single node via sos run in which case submitting a regular cluster job would be good enough. If we go multiple nodes my understanding is that sos execute has to be used anyways. So still I do not see a big appeal for separate host specification.

@BoPeng
Copy link
Contributor Author

BoPeng commented Nov 17, 2019

Currently my solution to the problem is indeed to use different host names but defining the hosts are straightforward

I agree that your solution is better than an additional parameter, since

task: queue='cluster_r3.3'

is better than

task: queue='cluster', template='r3.3'

and the underlying work is about the same. I mean, the workload for doing

cluster:
   job_template=...
   r3.3=...

is about the same for doing

cluster_r3.3:
   based_on: cluster:
   job_template

and is perhaps clearer.

In addition, users can always define their own templates with the -c config.yml mechanism.

The sos run is a different story because it essentially tries to submit the entire workflow to cluster with a template. Right now we have

sos run -r host [args]

being translated to

ssh host sos run [args]

and I am proposing more definition for host to do

ssh host qsub job.sh

This is exactly the same as our process and pbs task queue, so conceptually we are saying, we use -r host to execute the workflow on remote host, which is by default a process host but can also be a pbs host, in which case the workflow is submitted to the cluster.

@BoPeng
Copy link
Contributor Author

BoPeng commented Nov 17, 2019

So what we need is something like

cluster:
   job_template: | for task
cluster1:
  based_on: cluster
  job_template: | for sos run

The problem here is that there is no way to specify nodes and RAM etc from command line so they have to be hard-coded in the job_template. This is however, easier than writing a customized shell script and submit it manually from cluster.

Edit: no way is an overstatement. Since

sos run workflow --par1 a --par2 b

are allowed, it is easy to make the parameter dictionary available to the template, so users can do

sos run workflow -q cluster1 --mem 10G --ncores 5 

@BoPeng BoPeng closed this as completed Nov 18, 2019
@BoPeng BoPeng reopened this Nov 18, 2019
BoPeng pushed a commit that referenced this issue Nov 18, 2019
@BoPeng
Copy link
Contributor Author

BoPeng commented Nov 21, 2019

Merged.

@BoPeng BoPeng closed this as completed Nov 21, 2019
@gaow
Copy link
Member

gaow commented Nov 21, 2019

Is this new development also related to #1321 or actually already implements it? Otherwise what's the major impact of it? Would be great if there is a documentation to read about it.

@BoPeng
Copy link
Contributor Author

BoPeng commented Nov 21, 2019

I am working on the documentation, and the PBS version. Basically right now

  1. For each host entry you can define workflow_template, with system provided keys command, workflow_id, filename, script, and keys defined in
  2. From command line
sos run script -r host KEY=VALUE KEY1=VALUE1

Then SoS will populate the template and execute it.

There are several problems such as -c does not work in command (because it is stripped since the config file is supposed to be local and not guaranteed to exist in a new host).

The PBS version will submit the shell script through qsub etc, with the cores, nodes specified from command line.

@BoPeng
Copy link
Contributor Author

BoPeng commented Nov 21, 2019

job_template has been renamed to task_template, although job_template is still usable for backward compatibility.

@gaow
Copy link
Member

gaow commented Nov 21, 2019

Okay -- so for a single host case, if we want to run an entire workflow on a cluster system by logging in to the headnode and run from there, what would be the currently recommended method? For a small workflow submitting them as one multi-node job (and not using task, right?) might be better. But for large workflow reserving lots of resource at once upfront might not be a good idea? How should users choose between these? A single host setting in practice should be the most popular setting. I think it warrants more discussions on the remote_execution documentation.

@BoPeng
Copy link
Contributor Author

BoPeng commented Nov 21, 2019

if we want to run an entire workflow on a cluster system by logging in to the headnode and run from there, what would be the currently recommended method?

Not working yet, but by design:

  1. Add a workflow_template to the host definition
  2. sos run -r host mem=xx cores=xx nodes=xx

For a small workflow submitting them as one multi-node job (and not using task, right?) might be better.

If there are numerous small tasks, it should be much more efficient to run it this way without the task overhead.

But for large workflow reserving lots of resource at once upfront might not be a good idea?

I agree. The task mechanism is actually better for "development" of workflow or interactive data analysis because you run large steps one by one.

How should users choose between these?

I would

  1. Many small substeps, use -r
  2. Several big substeps, use -q

@BoPeng BoPeng changed the title Option cmd for task? workflow_template for remote workflow execution Nov 21, 2019
@gaow
Copy link
Member

gaow commented Nov 25, 2019

I'm trying out the current master of sos and sos-pbs. I changed job_template to task_template but it seems to require workflow_template instead?

INFO: Running module normal: 
ERROR: Failed to load workflow engine pbs: A workflow_template is required for queue midway2
ERROR: 'Host' object has no attribute '_task_engine'

I'm confused here what's the difference between task_template and workflow_template. Also as you can see above there is an attribute error. Please let me know if I need to upload a MWE to reproduce.

@gaow gaow reopened this Nov 25, 2019
@BoPeng
Copy link
Contributor Author

BoPeng commented Nov 25, 2019

I'm confused here what's the difference between task_template and workflow_template.

For the same host, task_template is used by sos -q and workflow_template is used by sos -r host KEY=VALUE. task_template is currently unchanged (but I plan to introduce {command}. The workflow_template should have

#PBS -n {workflow_id}
#PBS {nodes}
module load R/{version}
{command}

where workflow_id, filename, commands etc are provided, and the rest should be specified from command line with KEY=VALUE.

@gaow
Copy link
Member

gaow commented Nov 25, 2019

I have this script test.sos

[1]
task: queue = 'midway2_head', walltime = '3m', mem = '2G', cores = 1
print(1)

and I run it with issue_6.yml,

[MW] sos run test.sos -c issue_6.yml 
INFO: Running 1: 
WARNING: job_template for host configuration is deprecated. Please use task_template instead.
ERROR: Failed to load workflow engine pbs: A workflow_template is required for queue midway2_head
ERROR: [1]: [f152cccfd778561c]: Failed to load workflow engine pbs: A workflow_template is required for queue midway2_head

So this no longer works. Here I dont use -r method anyways so why is it still asking for workflow_template? Is it now required and there is no default?

@BoPeng
Copy link
Contributor Author

BoPeng commented Nov 25, 2019

Kind of busy now but did you upgrade sos-pbs?

@gaow
Copy link
Member

gaow commented Nov 25, 2019

Yes sos-pbs is on current master (it was not released yet). Also there is the ERROR: 'Host' object has no attribute '_task_engine' from my other example that hopefully is obvious enough to fix when you get a chance -- take your time!

@BoPeng
Copy link
Contributor Author

BoPeng commented Nov 25, 2019

ERROR: Failed to load workflow engine pbs: A workflow_template is required for queue midway2
ERROR: 'Host' object has no attribute '_task_engine'

The problem was that the code assumes the existence of both task_template and workflow_template for a PBS engine. I have changed to code to allow the definition of only one (or none).

I am not sure if missing _task_engine is the consequence of the first error.

@gaow
Copy link
Member

gaow commented Nov 25, 2019

Thanks -- it seems to work now (at least my old workflow now works again).

@gaow gaow closed this as completed Nov 25, 2019
@BoPeng
Copy link
Contributor Author

BoPeng commented Nov 25, 2019

Released sos 0.20.10 since this looks like a regression bug, but I am wondering why all task related tests passed though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants