Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracing existing input or dependent file should be an option rather than workflow feature #1197

Closed
BoPeng opened this issue Jan 29, 2019 · 15 comments
Assignees

Comments

@BoPeng
Copy link
Contributor

BoPeng commented Jan 29, 2019

I am not quite comfortable at the fact that a user would have to change the workflow from using input to depends, or vise versa if he or she would like to turn on/off tracing dependencies of existing dependencies. It would make sense to introduce an option to

  1. If the option is turned on, SoS will try to trace the dependencies of all existing targets, including the ones in input and in depends statements.
  2. If the option is turned off, SoS operates in the usual Makefile style, namely not tracing dependencies.

The option could be named -D (dependency) or -T (tracing), although it is a bit difficult to explain this concept very quickly to users.

Note that the existing behavior is more flexible in that it allows part of the workflow to be dependency tracing and part of it non-dependency tracing.

@gaow
Copy link
Member

gaow commented Jan 29, 2019

Note that the existing behavior is more flexible in that it allows part of the workflow to be dependency tracing and part of it non-dependency tracing.

This is exactly what I do with it. I think it is good behavior. Also, one can always do:

depends: '1.txt'
input: '1.txt'

to make it clear that we expect to track the source of 1.txt, right?

@BoPeng
Copy link
Contributor Author

BoPeng commented Jan 29, 2019

On the other hand, if a user have a makefile style workflow and would like to run it with dependency tracing, it is quite troublesome and sometime not feasible since at least depends does not support group_by etc.

@gaow
Copy link
Member

gaow commented Jan 29, 2019

Okay then how about make it an input option? One would still have to make it clear for each step, but this allows more flexibility.

@BoPeng
Copy link
Contributor Author

BoPeng commented Jan 30, 2019

The design goal of SoS is to make things easy and intuitive so at this point I am trying to cut down options if the added flexibility/power does not worth the added complexity (e.g. the active parameter with the presence of *_if actions).

My overall feeling is that this feature determines how to "run" a workflow. It is not something that a user would think of when they are writing it. I mean, it is not likely users would consider what to do if the input does not exist when they write a workflow, and the current design requires users to rewrite the workflow when they need dependency tracing (change input to depends and variables _input to _depends, and latter is generally not recommended to use directly).

So to make the change less painful, we could allow

input: 'a.txt', trace_input=True

or something like

input: produced('a.txt')

but it is still painful if the workflow is large. In comparison,

sos run -T

with an effect similar to -s force sounds more natural to me.

The other problem is with the -t target option, which did not have dependency tracing and now has dependency tracing, and I am not quite sure which way is better because users might get used to makefile style (no dependency tracing) and be irritated if the entire workflow runs repeatedly with -t. So something like -t target is default no tracing (makefile) style, and -Tt target as the more advanced dependency tracing style that saves the user from writing make clean-type of rules can be useful.

@BoPeng
Copy link
Contributor Author

BoPeng commented Jan 30, 2019

Also, I am not against the idea of explicit dependency tracing using input option etc, especially when we have named_output which not only imports the output, but also creates the dependency. A natural extension can be something like unnamed_output which matches filename but also creates dependency.

@gaow
Copy link
Member

gaow commented Jan 30, 2019

My overall feeling is that this feature determines how to "run" a workflow. It is not something that a user would think of when they are writing it.

I'm not sure if it is true. I think when users write a workflow, it is not unreasonable to assume they know whether or not some files are provided externally or from other steps. For example:

[global]
file1 = '1.txt'
file2 = '2.txt'

[some_step]
output: file2

[another_step]
input: file1, file2

In this case, I guess

[another_step]
input: file1, unnamed_output(file2)

makes it more explicit compared to using -T and let SoS do the work.

On the other hand, I do appreciate the elegance of -T. If I were to choose between one solution, I'd perhaps go for -T, because it is simpler in a good way, despite its limitations.

@BoPeng
Copy link
Contributor Author

BoPeng commented Jan 30, 2019

Yes, the users know which files are external and which files are produced by others but they do not have to differentiate them in the workflow using unnamed_output(), and it can be a burden to maintain a script, for example if one decides to add a step to download an external file, in which case he/she will need to change a file from external to output of the download step.

But the biggest advantage of -T is to handle inputs because I really do not want to add group_by to depends: so that it can be used to replace input:.

@gaow
Copy link
Member

gaow commented Jan 30, 2019

So we agree to keep all current behavior as is, and add -T, and possibly revisit it in the future if there is a need for more complicated behavior?

@BoPeng
Copy link
Contributor Author

BoPeng commented Jan 30, 2019

I want to revert the depends magic.

@gaow
Copy link
Member

gaow commented Jan 30, 2019

The objection I have is that for workflows with large number of inputs and outputs -T would have a big performance loss if we are in fact only trying to keep track of a couple of very specific files ... But I agree depends is a partial and hacky solution to it; and if that is really a problem we'll possibly use input options or unnamed_output anyways. Let's revert it and see how soon in the future you or I (or others) complain about performance.

@BoPeng
Copy link
Contributor Author

BoPeng commented Jan 30, 2019

OK, my belief is that performance can be improved over time but complex syntax will only get worse.

BoPeng pushed a commit that referenced this issue Jan 30, 2019
BoPeng pushed a commit that referenced this issue Jan 30, 2019
BoPeng pushed a commit that referenced this issue Jan 30, 2019
@BoPeng
Copy link
Contributor Author

BoPeng commented Jan 30, 2019

unnamed_output is very long and not very specified so I introduced traced, which means that dependency of the target will be traced with/without -T.

That is to say, step A in the following workflow will be executed regardless of -T,

[A]
output: 'a.txt'
_output.touch()

[default]
input: traced('a.txt')
print('ok')

and

[A]
output: 'a.txt'
_output.touch()

[default]
input: 'a.txt'
print('ok')

will behave differently with/without -T.

With #1199 , I think the performance impact of -T should be more or less acceptable and traced should help the creation of static makefile style workflow (???).

@BoPeng
Copy link
Contributor Author

BoPeng commented Jan 31, 2019

Dependency creation document is updated again , with the new -T option. This option is pretty advanced but can surprisingly simplify the examples because I do not have to remove intermediate files between examples.

@BoPeng
Copy link
Contributor Author

BoPeng commented Jan 31, 2019

@gaow I have created a tutorial on the use of -T option and traced() function, I will consider this issue resolved if you do not have any other suggestion.

@gaow
Copy link
Member

gaow commented Jan 31, 2019

It reads great -- it's nice that you ended up implementing both approaches! Yes this issue is resolved.

@gaow gaow closed this as completed Jan 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants