Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iter_collections and ls-file-collection #343

Merged
merged 19 commits into from
May 11, 2023
Merged

iter_collections and ls-file-collection #343

merged 19 commits into from
May 11, 2023

Conversation

mih
Copy link
Member

@mih mih commented May 8, 2023

This PR is making add-archive-content (largely) obsolete. Closes #183

Summary

  • new module iter_collections to establish a location in the spirit of Consolidate "traverser" from gooey and metalad into a common implementation #323 . The idea is to use dataclasses instead of result records, and enums instead of string labels to report on collection content. Iterators also support hashing (as means to generate identifiers.
    • import iterdir() from datalad-gooey as an iterator for a directory as a collections (of directory content)
    • implemented itertar() for tar file archives -- reporting aligned with iterdir()
  • ls-file-collection command that can drive these iterators.
    • factors any glue code necessary into a custom validator
    • converts collection iterator items into result records and yields them

This is somewhat in conflict with #342 due to suboptimal communication. A future discussion needs to sort out how the two approaches can be consolidated.

TODO:

  • we need to be able hash collection items, ideally using a configurable algorithm. The command draft prepares for that. however, it is still unclear whether this would be done in a collection iterator (pretty strong requirement), or within the command itself (complicated to achieve, because it would require duplicating any possibly necessary data transport logic between the iterator implementation and the command). That being said, one could envision remote collections that can readily provide a hash -- for those it would make most sense to let the iterator report them. So maybe a hash parameter of ls-file-collection should be an instruction what to compute, if there is no hash already. If so, an iterator would no only need to report a hash, but also the type of the hash it is reporting.
    It could be a sensible middle-ground to have an iterator provide a zero-argument callable that provides a file-like that generic hash-computation can use, if the controlling command deems it necessary (e.g. a requested hash is not readily provided).
    That being said, it is unlikely that a generic implementation of a read-from-some-file-like is going to be optimal. Considering the bandwith of possibilities and necessities for download (i.e. anything remote; streaming, chunks size, etc) alone cries for configurability. So maybe going down the path of UrlOperations which can hash on the go is a better approach. Maybe factor that out into a dedicated class, and simply make it easy to use...
  • for Implement alternative to add-archive-content #183 we need an iterator for archives (we could have two, a simple one for local archives, and a more sophisticated for remote archives (see Draft for Issue 183 list collection #342) -- but they would not necessarily differ much in the end, if hashing must be perform -- rather than being able to report an existing hash.https://doi.org/10.1038/s41597-022-01163-2
  • Draft for Issue 183 list collection #342 mentions auto-detection of collection types. I initially had the same idea, but then could not come up with a use case where one would want to (surrender to) auto-detect. What auto would do could/would change over time. Conflict would always be there (even if a had a gitrepo worktree, I might still just need a directory listing). auto could maybe provide a bit of perceived convenience for interactive use, but that convenience would need to be bough with added complexity of auto-detection and for preserving its outcome, long-term. Not worth it IMHO.
  • legit error on windows (symlink target looks broken)
       FAILED ..\iter_collections\tests\test_iterdir.py::test_iterdir - AssertionError: assert IterdirItem(path=WindowsPath('C:/DLTMP/pytest-of-appveyor/pytest-0/dir_tree0/symlink'), type=<PathType.symlink: 'symlink'>, symlink_target=WindowsPath('//?/C:/DLTMP/pytest-of-appveyor/pytest-0/dir_tree0/some_dir/file_in_dir.txt')) in [IterdirItem(path=WindowsPath('C:/DLTMP/pytest-of-appveyor/pytest-0/dir_tree0/random_file1.txt'), type=<PathType.file: 'file'>, symlink_target=None), IterdirItem(path=WindowsPath('C:/DLTMP/pytest-of-appveyor/pytest-0/dir_tree0/some_dir'), type=<PathType.directory: 'directory'>, symlink_target=None), IterdirItem(path=WindowsPath('C:/DLTMP/pytest-of-appveyor/pytest-0/dir_tree0/symlink'), type=<PathType.symlink: 'symlink'>, symlink_target=WindowsPath('C:/DLTMP/pytest-of-appveyor/pytest-0/dir_tree0/some_dir/file_in_dir.txt'))]

This seems to be related to the following doc snippet

Changed in version 3.8: Added support for directory junctions, and changed to return the substitution path (which typically includes \?\ prefix) rather than the optional “print name” field that was previously returned.

https://bugs.python.org/issue42957

  • report file sizes for all collection types
  • Stop using StrEnum -- only available from Python 3.11 onwards
  • Bring test coverage of iter_collections to 100%
  • Add tests for ls-file-collection
  • Fully document iter_collections
  • Document ls-file-collection. Include a demo on replacing add-archive-content
  • Make decision on representation of hardlinks (see TODO in test_replace_add_archive_content)
  • Something not correct with path type from itertar on windows yet
  • Add custom result renderer

Addressed issues

@jsheunis
Copy link
Member

jsheunis commented May 8, 2023

I lack the experience and understanding wrt this functionality to comment specifically on the thought process or implementation, but just noting for the record that I ran a test locally and all seems to work nicely:

# create dataset with multi-level files

$ datalad create lscollection_test
$ mkdir lscollection_test/subdir
$ echo test1 > lscollection_test/topfile.txt
$ echo test2 > lscollection_test/subdir/subfile.txt
$ datalad save

# run command on root dir

$ datalad -f json ls-file-collection directory lscollection_test

{"action": "ls_file_collection", "collection": "lscollection_test", "item": "lscollection_test/subdir", "status": "ok", "type": "directory"}
{"action": "ls_file_collection", "collection": "lscollection_test", "item": "lscollection_test/topfile.txt", "status": "ok", "symlink_target": ".git/annex/objects/1Z/Z0/MD5E-s6--3e7705498e8be60520841409ebc69bc1.txt/MD5E-s6--3e7705498e8be60520841409ebc69bc1.txt", "type": "symlink"}
{"action": "ls_file_collection", "collection": "lscollection_test", "item": "lscollection_test/.gitattributes", "status": "ok", "type": "file"}
{"action": "ls_file_collection", "collection": "lscollection_test", "item": "lscollection_test/.git", "status": "ok", "type": "directory"}
{"action": "ls_file_collection", "collection": "lscollection_test", "item": "lscollection_test/.datalad", "status": "ok", "type": "directory"}

# run command on sub dir

$ datalad -f json ls-file-collection directory lscollection_test/subdir

{"action": "ls_file_collection", "collection": "lscollection_test/subdir", "item": "lscollection_test/subdir/subfile.txt", "status": "ok", "symlink_target": "../.git/annex/objects/fx/MG/MD5E-s6--126a8a51b9d1bbd07fddc65819a542c3.txt/MD5E-s6--126a8a51b9d1bbd07fddc65819a542c3.txt", "type": "symlink"}

mih added 3 commits May 9, 2023 06:40
This is a start for consolidating common functionality scattered
around various extensions into a single implementation (pattern).

This changeset import `iterdir()` from `datalad-gooey`. In contrast to
the original implementation, this new one is using a stricter approach
to types, and overfits less to a dataset-aware use case.

However, it is not meant to be the exclusive implementation, but merely
a start and a place to migrate directory iterators into.

Ping datalad#323
mih added 2 commits May 9, 2023 09:26
This is trying to be structurally similar to the `directory` collection
implementation.
@mih
Copy link
Member Author

mih commented May 9, 2023

Brief look at performance. Comparing the envisioned CLI use for a TAR file to get all infos needed to satisfy datalad addurls

datalad -f json ls-file-collection --hash md5 tarfile <TARFILE>

with the conventional equivalent (actually not, but in practice it would be possible to quickly grab the same info):

tar --to-command=md5sum -xf <TARFILE>

Test case is a 2.5GB tgz with 1470 files inside.

datalad ls-file-collection:

            Mean        Std.Dev.    Min         Median      Max
real        6.981       0.384       6.688       6.732       7.523       
user        6.366       0.445       5.935       6.187       6.978       
sys         0.584       0.133       0.452       0.535       0.765       

tar + md5sum:

            Mean        Std.Dev.    Min         Median      Max
real        14.247      0.730       13.535      13.954      15.251      
user        17.563      0.931       16.752      17.072      18.867      
sys         2.766       0.246       2.556       2.630       3.111

Promising!

@mih
Copy link
Member Author

mih commented May 9, 2023

Tests are good now, the failure is httpbin accessibility

Using a pre-crafted (300 byte) tarball that is put on github, and is
downloaded once per session.
@codecov
Copy link

codecov bot commented May 10, 2023

Codecov Report

Patch coverage: 99.21% and project coverage change: +0.23 🎉

Comparison is base (9c65b96) 91.98% compared to head (4adba2c) 92.22%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #343      +/-   ##
==========================================
+ Coverage   91.98%   92.22%   +0.23%     
==========================================
  Files          86       93       +7     
  Lines        7478     7730     +252     
==========================================
+ Hits         6879     7129     +250     
- Misses        599      601       +2     
Impacted Files Coverage Δ
datalad_next/__init__.py 100.00% <ø> (ø)
datalad_next/constraints/parameter.py 100.00% <ø> (ø)
datalad_next/iter_collections/utils.py 96.66% <96.66%> (ø)
datalad_next/commands/ls_file_collection.py 98.48% <98.48%> (ø)
datalad_next/commands/__init__.py 88.23% <100.00%> (ø)
...lad_next/commands/tests/test_ls_file_collection.py 100.00% <100.00%> (ø)
datalad_next/conftest.py 100.00% <100.00%> (ø)
...d_next/constraints/tests/test_cmdarg_validation.py 100.00% <100.00%> (ø)
datalad_next/iter_collections/directory.py 100.00% <100.00%> (ø)
datalad_next/iter_collections/tarfile.py 100.00% <100.00%> (ø)
... and 2 more

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

mih added 4 commits May 10, 2023 17:59
Also include a check for these requirements that is executed in
the error case (no performance critical), to inform developers
about obvious implementation issues.

Closes datalad#348
This is (also) an alternative approach to `add-archive-content`.

In comparison to the former, this is largely metadata driven, and
works without (local) extraction of a tarball. This saves storage
overhead, and makes it possible to run some parts of the ingestion
pipeline on a remote system.

Closes datalad#183
@mih
Copy link
Member Author

mih commented May 11, 2023

Few more bits on performance. This time on the directory collection type. On a single directory with 36k files (I think that is a reasonable number of a directory with "a lot of files":

In [8]: timeit r = dl.ls_file_collection('directory', '/home/mih/dcm/study_20140724_095117357
   ...: ', result_renderer='disabled')
1.36 s ± 7.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: timeit !ls -l ~/dcm/study_20140724_095117357 > /dev/null
439 ms ± 62.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [21]: timeit l = list(iterdir(Path('/home/mih/dcm/study_20140724_095117357')))
437 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With so many files, offset costs for argument validation etc. no longer matter much. Speed is about 1/3 of a plain ls for going through the high-level command API. This is not so bad for a first attempt IMHO.

Importantly, the actual iterator is on the same performance-level as ls itself.

@jsheunis
Copy link
Member

jsheunis commented May 11, 2023

(this comment disappeared from github when I posted it, so sorry if it's duplicated)

I just realised that ls-file-collection can be a great lean replacement of meta-extract or meta-conduct when it comes to supplying metadata of the whole file tree of some file hierarchy to datalad-catalog. Obviously it provides somewhat different information, but some scripting can fill the gaps of what datalad-catalog needs and what ls-file-collection does not (currently) provide. And the bump in performance and ability to provide information on non-datalad datasets (or non-git repos) are major gains!

Some current differences (for the context of catalog file metadata generation; and based on my limited experience with ls-file-collection):

meta-extract ls-file-collection
dataset traversing recurses with meta-conduct does not recurse; will need scripting or some integration with tree
content availability provides availability data of file content if available in annex provides location of symlink in annex, but not availability data
content size gives size of annexed contents even if not local gives size of symlink, or of file if local
runs on datalad dataset directory, tarfile, (URL)
id and version info provides id and version of parent dataset does not provide id and version of parent dataset (no concept of parent)

@mslw
Copy link
Collaborator

mslw commented May 11, 2023

I had similar thoughts @jsheunis (thanks for the great summary!) - and it seems that "iterworktree" and "itergittree" which were already proposed in separate issues could help with that too. For non-git directories, I imagine that we could also add a "directory-recursive" collection that would be based on Path.rglob() instead of Path.iterdir() used by "directory" (nb. Path.walk() is coming in Python 3.12).

On the note of recursion - seems pretty obvious , but while ls-file-collection on a nested directory is non-recursive (and documentation says so) ls-file-collection on a tar file made from that directory will list all items because of the tar file's nature (so everything is perfectly in order and in scope).

Overall, on the few small examples that I tried the command behaved predictably, and the code looks very modular which will be nice for any potential additions (and to be clear, mentions of "other collections" above were just general comments that need not & should not be added in this pr).

@mih
Copy link
Member Author

mih commented May 11, 2023

Hey @jsheunis @mslw just briefly confirming that this kind of reuse is indeed not only in scope, but more or less the focus. However, I want to point out that ls-file-collection is primarily a convenience "driver" for the underlying iterator. As can be seen from the benchmark above, there is a substantial speed difference between the command API and the iterator. This is something that continued to bug me about how I wrote status and diff. So I would suspect that within-datalad reuse will focus on the iterator, not ls-file-collection (which is still useful for CLI scripting).

That being said, the iterators can be made more customizable, once it is clear how it would make sense. However, I would prefer and aim for multiple, clear, and focused implementations, rather than sophisticated swiss-army-knifes.

Next up would be git-focused iterators

They should make it possible to drop a substantial amount of custom code from gooey.

I did start thinking about annex-focused iterators, but I have not concluded on a concept yet (will file an issue once I know something). My current thinking is exploring the possibility of wrapping annex find.

@mih
Copy link
Member Author

mih commented May 11, 2023

I will merge this now to enable further parallel incremental improvements.

@mih mih merged commit 93cb31f into datalad:main May 11, 2023
@mih mih deleted the lstcol branch May 11, 2023 17:09
@jsheunis
Copy link
Member

Thanks for the explanation @mih, it gives me better context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

str(CommandParametrizationError(...)) with with KeyError Implement alternative to add-archive-content
3 participants