Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The nxf_unstage function unnecessarily copies input files that match output glob pattern. #3995

Open
robsyme opened this issue Jun 2, 2023 · 2 comments

Comments

@robsyme
Copy link
Collaborator

robsyme commented Jun 2, 2023

Bug report

Many users will use the scratch true directive, in part to minimize the size of the shared work directory - to ensure that the files saved to the work directory are restricted to only those necessary for downstream tasks and for the resume mechanism.

In cases where a process outputs glob pattern also matches the input file, the input file is unnecessarily copied back into the shared work directory

Steps to reproduce the problem

Given main.nf:

process GreedyOutputGlob {
    scratch true
    input: path(csv)
    output: path("*.csv")
    script: "cp $csv out.csv"
}

workflow {
    Channel.fromPath("data/in.csv")
    | GreedyOutputGlob
    | view
}

Note that the in.csv file is copied back to the shared work directory:

❯ nextflow run .      
N E X T F L O W  ~  version 23.04.1
Launching `./main.nf` [hopeful_church] DSL2 - revision: 06d2458686
executor >  local (1)
[42/2fa08b] process > GreedyOutputGlob (1) [100%] 1 of 1 ✔
/private/tmp/foo/work/42/2fa08b2ef83cd1799c58833592deed/out.csv


/tmp/foo on ☁️  sts on ☁️  devstar2002@gcplab.me took 2s 
❯ tree work 
work
└── 42
    └── 2fa08b2ef83cd1799c58833592deed
        ├── in.csv
        └── out.csv

3 directories, 2 files

This is because the nxf_unstage command uses the output glob pattern directly, without regard to the input files:

# ...
for name in $(eval "ls -1d *.csv" | sort | uniq); do
    nxf_fs_copy "$name" /private/tmp/foo/work/42/2fa08b2ef83cd1799c58833592deed || true
done
# ...

Expected behaviour and actual behaviour

To help users save storing the duplicated input files, it would be better if Nextflow excluded input files from being copied back to the shared work directory (unless the includeInputs: true argument is included in the outputs: block).

Environment

  • Nextflow version: 23.04.1
  • Java version: openjdk version "17.0.5" 2022-10-18
  • Operating system: all
  • Bash version: all
    (Add any other context about the problem here)
Copy link

stale bot commented Dec 15, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 15, 2023
@mahesh-panchal
Copy link
Contributor

I just want to link a related issue which is that hidden files should also be unstaged correctly too #2983 if this is reworked.

@stale stale bot removed the stale label Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants