`iter_collections` and `ls-file-collection` #343

mih · 2023-05-08T08:39:39Z

This PR is making add-archive-content (largely) obsolete. Closes #183

Summary

new module iter_collections to establish a location in the spirit of Consolidate "traverser" from gooey and metalad into a common implementation #323 . The idea is to use dataclasses instead of result records, and enums instead of string labels to report on collection content. Iterators also support hashing (as means to generate identifiers.
- import iterdir() from datalad-gooey as an iterator for a directory as a collections (of directory content)
- implemented itertar() for tar file archives -- reporting aligned with iterdir()
ls-file-collection command that can drive these iterators.
- factors any glue code necessary into a custom validator
- converts collection iterator items into result records and yields them

This is somewhat in conflict with #342 due to suboptimal communication. A future discussion needs to sort out how the two approaches can be consolidated.

TODO:

we need to be able hash collection items, ideally using a configurable algorithm. The command draft prepares for that. however, it is still unclear whether this would be done in a collection iterator (pretty strong requirement), or within the command itself (complicated to achieve, because it would require duplicating any possibly necessary data transport logic between the iterator implementation and the command). That being said, one could envision remote collections that can readily provide a hash -- for those it would make most sense to let the iterator report them. So maybe a hash parameter of ls-file-collection should be an instruction what to compute, if there is no hash already. If so, an iterator would no only need to report a hash, but also the type of the hash it is reporting.
It could be a sensible middle-ground to have an iterator provide a zero-argument callable that provides a file-like that generic hash-computation can use, if the controlling command deems it necessary (e.g. a requested hash is not readily provided).
That being said, it is unlikely that a generic implementation of a read-from-some-file-like is going to be optimal. Considering the bandwith of possibilities and necessities for download (i.e. anything remote; streaming, chunks size, etc) alone cries for configurability. So maybe going down the path of UrlOperations which can hash on the go is a better approach. Maybe factor that out into a dedicated class, and simply make it easy to use...
for Implement alternative to add-archive-content #183 we need an iterator for archives (we could have two, a simple one for local archives, and a more sophisticated for remote archives (see Draft for Issue 183 list collection #342) -- but they would not necessarily differ much in the end, if hashing must be perform -- rather than being able to report an existing hash.https://doi.org/10.1038/s41597-022-01163-2
Draft for Issue 183 list collection #342 mentions auto-detection of collection types. I initially had the same idea, but then could not come up with a use case where one would want to (surrender to) auto-detect. What auto would do could/would change over time. Conflict would always be there (even if a had a gitrepo worktree, I might still just need a directory listing). auto could maybe provide a bit of perceived convenience for interactive use, but that convenience would need to be bough with added complexity of auto-detection and for preserving its outcome, long-term. Not worth it IMHO.
legit error on windows (symlink target looks broken)

       FAILED ..\iter_collections\tests\test_iterdir.py::test_iterdir - AssertionError: assert IterdirItem(path=WindowsPath('C:/DLTMP/pytest-of-appveyor/pytest-0/dir_tree0/symlink'), type=<PathType.symlink: 'symlink'>, symlink_target=WindowsPath('//?/C:/DLTMP/pytest-of-appveyor/pytest-0/dir_tree0/some_dir/file_in_dir.txt')) in [IterdirItem(path=WindowsPath('C:/DLTMP/pytest-of-appveyor/pytest-0/dir_tree0/random_file1.txt'), type=<PathType.file: 'file'>, symlink_target=None), IterdirItem(path=WindowsPath('C:/DLTMP/pytest-of-appveyor/pytest-0/dir_tree0/some_dir'), type=<PathType.directory: 'directory'>, symlink_target=None), IterdirItem(path=WindowsPath('C:/DLTMP/pytest-of-appveyor/pytest-0/dir_tree0/symlink'), type=<PathType.symlink: 'symlink'>, symlink_target=WindowsPath('C:/DLTMP/pytest-of-appveyor/pytest-0/dir_tree0/some_dir/file_in_dir.txt'))]

This seems to be related to the following doc snippet

Changed in version 3.8: Added support for directory junctions, and changed to return the substitution path (which typically includes \?\ prefix) rather than the optional “print name” field that was previously returned.

https://bugs.python.org/issue42957

report file sizes for all collection types
Stop using StrEnum -- only available from Python 3.11 onwards
Bring test coverage of iter_collections to 100%
Add tests for ls-file-collection
Fully document iter_collections
Document ls-file-collection. Include a demo on replacing add-archive-content
Make decision on representation of hardlinks (see TODO in test_replace_add_archive_content)
Something not correct with path type from itertar on windows yet
Add custom result renderer

Addressed issues

Closes str(CommandParametrizationError(...)) with with KeyError #348

jsheunis · 2023-05-08T12:23:46Z

I lack the experience and understanding wrt this functionality to comment specifically on the thought process or implementation, but just noting for the record that I ran a test locally and all seems to work nicely:

# create dataset with multi-level files

$ datalad create lscollection_test
$ mkdir lscollection_test/subdir
$ echo test1 > lscollection_test/topfile.txt
$ echo test2 > lscollection_test/subdir/subfile.txt
$ datalad save

# run command on root dir

$ datalad -f json ls-file-collection directory lscollection_test

{"action": "ls_file_collection", "collection": "lscollection_test", "item": "lscollection_test/subdir", "status": "ok", "type": "directory"}
{"action": "ls_file_collection", "collection": "lscollection_test", "item": "lscollection_test/topfile.txt", "status": "ok", "symlink_target": ".git/annex/objects/1Z/Z0/MD5E-s6--3e7705498e8be60520841409ebc69bc1.txt/MD5E-s6--3e7705498e8be60520841409ebc69bc1.txt", "type": "symlink"}
{"action": "ls_file_collection", "collection": "lscollection_test", "item": "lscollection_test/.gitattributes", "status": "ok", "type": "file"}
{"action": "ls_file_collection", "collection": "lscollection_test", "item": "lscollection_test/.git", "status": "ok", "type": "directory"}
{"action": "ls_file_collection", "collection": "lscollection_test", "item": "lscollection_test/.datalad", "status": "ok", "type": "directory"}

# run command on sub dir

$ datalad -f json ls-file-collection directory lscollection_test/subdir

{"action": "ls_file_collection", "collection": "lscollection_test/subdir", "item": "lscollection_test/subdir/subfile.txt", "status": "ok", "symlink_target": "../.git/annex/objects/fx/MG/MD5E-s6--126a8a51b9d1bbd07fddc65819a542c3.txt/MD5E-s6--126a8a51b9d1bbd07fddc65819a542c3.txt", "type": "symlink"}

This is a start for consolidating common functionality scattered around various extensions into a single implementation (pattern). This changeset import `iterdir()` from `datalad-gooey`. In contrast to the original implementation, this new one is using a stricter approach to types, and overfits less to a dataset-aware use case. However, it is not meant to be the exclusive implementation, but merely a start and a place to migrate directory iterators into. Ping datalad#323

This is trying to be structurally similar to the `directory` collection implementation.

mih · 2023-05-09T07:50:18Z

Brief look at performance. Comparing the envisioned CLI use for a TAR file to get all infos needed to satisfy datalad addurls

datalad -f json ls-file-collection --hash md5 tarfile <TARFILE>

with the conventional equivalent (actually not, but in practice it would be possible to quickly grab the same info):

tar --to-command=md5sum -xf <TARFILE>

Test case is a 2.5GB tgz with 1470 files inside.

datalad ls-file-collection:

            Mean        Std.Dev.    Min         Median      Max
real        6.981       0.384       6.688       6.732       7.523       
user        6.366       0.445       5.935       6.187       6.978       
sys         0.584       0.133       0.452       0.535       0.765

tar + md5sum:

            Mean        Std.Dev.    Min         Median      Max
real        14.247      0.730       13.535      13.954      15.251      
user        17.563      0.931       16.752      17.072      18.867      
sys         2.766       0.246       2.556       2.630       3.111

Promising!

Only one path type enum, only one item dataclass (for now).

mih · 2023-05-09T14:43:30Z

Tests are good now, the failure is httpbin accessibility

Now also for link targets.

Using a pre-crafted (300 byte) tarball that is put on github, and is downloaded once per session.

codecov · 2023-05-10T13:32:01Z

Codecov Report

Patch coverage: 99.21% and project coverage change: +0.23 🎉

Comparison is base (9c65b96) 91.98% compared to head (4adba2c) 92.22%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #343      +/-   ##
==========================================
+ Coverage   91.98%   92.22%   +0.23%     
==========================================
  Files          86       93       +7     
  Lines        7478     7730     +252     
==========================================
+ Hits         6879     7129     +250     
- Misses        599      601       +2

Impacted Files	Coverage Δ
datalad_next/__init__.py	`100.00% <ø> (ø)`
datalad_next/constraints/parameter.py	`100.00% <ø> (ø)`
datalad_next/iter_collections/utils.py	`96.66% <96.66%> (ø)`
datalad_next/commands/ls_file_collection.py	`98.48% <98.48%> (ø)`
datalad_next/commands/__init__.py	`88.23% <100.00%> (ø)`
...lad_next/commands/tests/test_ls_file_collection.py	`100.00% <100.00%> (ø)`
datalad_next/conftest.py	`100.00% <100.00%> (ø)`
...d_next/constraints/tests/test_cmdarg_validation.py	`100.00% <100.00%> (ø)`
datalad_next/iter_collections/directory.py	`100.00% <100.00%> (ø)`
datalad_next/iter_collections/tarfile.py	`100.00% <100.00%> (ø)`
... and 2 more

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Also include a check for these requirements that is executed in the error case (no performance critical), to inform developers about obvious implementation issues. Closes datalad#348

This is (also) an alternative approach to `add-archive-content`. In comparison to the former, this is largely metadata driven, and works without (local) extraction of a tarball. This saves storage overhead, and makes it possible to run some parts of the ingestion pipeline on a remote system. Closes datalad#183

mih · 2023-05-11T05:58:50Z

Few more bits on performance. This time on the directory collection type. On a single directory with 36k files (I think that is a reasonable number of a directory with "a lot of files":

In [8]: timeit r = dl.ls_file_collection('directory', '/home/mih/dcm/study_20140724_095117357
   ...: ', result_renderer='disabled')
1.36 s ± 7.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: timeit !ls -l ~/dcm/study_20140724_095117357 > /dev/null
439 ms ± 62.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [21]: timeit l = list(iterdir(Path('/home/mih/dcm/study_20140724_095117357')))
437 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With so many files, offset costs for argument validation etc. no longer matter much. Speed is about 1/3 of a plain ls for going through the high-level command API. This is not so bad for a first attempt IMHO.

Importantly, the actual iterator is on the same performance-level as ls itself.

jsheunis · 2023-05-11T10:11:10Z

(this comment disappeared from github when I posted it, so sorry if it's duplicated)

I just realised that ls-file-collection can be a great lean replacement of meta-extract or meta-conduct when it comes to supplying metadata of the whole file tree of some file hierarchy to datalad-catalog. Obviously it provides somewhat different information, but some scripting can fill the gaps of what datalad-catalog needs and what ls-file-collection does not (currently) provide. And the bump in performance and ability to provide information on non-datalad datasets (or non-git repos) are major gains!

Some current differences (for the context of catalog file metadata generation; and based on my limited experience with ls-file-collection):

	meta-extract	ls-file-collection
dataset traversing	recurses with `meta-conduct`	does not recurse; will need scripting or some integration with `tree`
content availability	provides availability data of file content if available in annex	provides location of symlink in annex, but not availability data
content size	gives size of annexed contents even if not local	gives size of symlink, or of file if local
runs on	datalad dataset	directory, tarfile, (URL)
id and version info	provides id and version of parent dataset	does not provide id and version of parent dataset (no concept of parent)

mslw · 2023-05-11T15:46:47Z

I had similar thoughts @jsheunis (thanks for the great summary!) - and it seems that "iterworktree" and "itergittree" which were already proposed in separate issues could help with that too. For non-git directories, I imagine that we could also add a "directory-recursive" collection that would be based on Path.rglob() instead of Path.iterdir() used by "directory" (nb. Path.walk() is coming in Python 3.12).

On the note of recursion - seems pretty obvious , but while ls-file-collection on a nested directory is non-recursive (and documentation says so) ls-file-collection on a tar file made from that directory will list all items because of the tar file's nature (so everything is perfectly in order and in scope).

Overall, on the few small examples that I tried the command behaved predictably, and the code looks very modular which will be nice for any potential additions (and to be clear, mentions of "other collections" above were just general comments that need not & should not be added in this pr).

mih · 2023-05-11T16:13:38Z

Hey @jsheunis @mslw just briefly confirming that this kind of reuse is indeed not only in scope, but more or less the focus. However, I want to point out that ls-file-collection is primarily a convenience "driver" for the underlying iterator. As can be seen from the benchmark above, there is a substantial speed difference between the command API and the iterator. This is something that continued to bug me about how I wrote status and diff. So I would suspect that within-datalad reuse will focus on the iterator, not ls-file-collection (which is still useful for CLI scripting).

That being said, the iterators can be made more customizable, once it is clear how it would make sense. However, I would prefer and aim for multiple, clear, and focused implementations, rather than sophisticated swiss-army-knifes.

Next up would be git-focused iterators

They should make it possible to drop a substantial amount of custom code from gooey.

I did start thinking about annex-focused iterators, but I have not concluded on a concept yet (will file an issue once I know something). My current thinking is exploring the possibility of wrapping annex find.

mih · 2023-05-11T17:09:13Z

I will merge this now to enable further parallel incremental improvements.

jsheunis · 2023-05-11T17:25:19Z

Thanks for the explanation @mih, it gives me better context.

mih force-pushed the lstcol branch from efc578a to 329e283 Compare May 8, 2023 09:52

mih mentioned this pull request May 8, 2023

Factor out multi-hash helper from UrlOperations #344

Closed

mih force-pushed the lstcol branch from 329e283 to cb6cba9 Compare May 9, 2023 02:30

mih added 3 commits May 9, 2023 06:40

Extend type-annotation

521d751

Add ability to compute multiple file hashes for directory collections

f2502b8

mih force-pushed the lstcol branch from cb6cba9 to 2bb0ba6 Compare May 9, 2023 05:57

mih added 2 commits May 9, 2023 09:26

Generalize documentation and labeling

9e3064e

Add support for TAR archive file collections

6b070b1

This is trying to be structurally similar to the `directory` collection implementation.

mih force-pushed the lstcol branch from 2bb0ba6 to 3f813b5 Compare May 9, 2023 07:34

mih added 2 commits May 9, 2023 10:23

Establish reporting parity for directory and tarfile iterators

01c3d27

Cannot (yet) use StrEnum, too recent addition

09632d6

mih force-pushed the lstcol branch from 3f813b5 to 8aa8b7c Compare May 9, 2023 09:00

mih mentioned this pull request May 9, 2023

Implement alternative to add-archive-content #183

Closed

mih added 2 commits May 9, 2023 13:49

Consolidate and deduplicate across collection iterators

3ae477b

Only one path type enum, only one item dataclass (for now).

kw_only dataclasses are PY3.10+ only

6568fc2

mih force-pushed the lstcol branch from b33b698 to 45e731f Compare May 9, 2023 12:50

mih added 4 commits May 9, 2023 19:41

Improve cross-platform reliability of test

09a7513

Test hashing in iterdir()

059070a

Standardize itertar() to always report (platform) PurePath

3fa4db0

Now also for link targets.

Diversify FileSystemItemType, make mtime and mode optional

8832b9b

mih force-pushed the lstcol branch from fbf4a25 to 3d7990c Compare May 10, 2023 06:05

Add tests for itertar()

40c4f55

Using a pre-crafted (300 byte) tarball that is put on github, and is downloaded once per session.

mih force-pushed the lstcol branch from 3d7990c to 2dd3993 Compare May 10, 2023 11:04

mih mentioned this pull request May 10, 2023

New MultiHash class #345

Merged

mih force-pushed the lstcol branch from 2dd3993 to 5a296e5 Compare May 10, 2023 12:26

mih added 4 commits May 10, 2023 17:59

Align Path types with specification (and tests)

6683ddb

Clarify how joint validators must report violations

b1655cc

Also include a check for these requirements that is executed in the error case (no performance critical), to inform developers about obvious implementation issues. Closes datalad#348

Expand documentation on iter_collections

fcd0d0c

mih force-pushed the lstcol branch from 0fee7b0 to 5ed2248 Compare May 10, 2023 16:15

Fix typo

4adba2c

This was referenced May 10, 2023

Implement iterworktree() for ls-file-collection #350

Closed

Implement itergittree() for ls-file-collection #349

Closed

mih merged commit 93cb31f into datalad:main May 11, 2023

mih deleted the lstcol branch May 11, 2023 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`iter_collections` and `ls-file-collection` #343

`iter_collections` and `ls-file-collection` #343

mih commented May 8, 2023 •

edited

Loading

jsheunis commented May 8, 2023

mih commented May 9, 2023

mih commented May 9, 2023

codecov bot commented May 10, 2023 •

edited

Loading

mih commented May 11, 2023 •

edited

Loading

jsheunis commented May 11, 2023 •

edited

Loading

mslw commented May 11, 2023 •

edited

Loading

mih commented May 11, 2023

mih commented May 11, 2023

jsheunis commented May 11, 2023

iter_collections and ls-file-collection #343

iter_collections and ls-file-collection #343

Conversation

mih commented May 8, 2023 • edited Loading

jsheunis commented May 8, 2023

mih commented May 9, 2023

mih commented May 9, 2023

codecov bot commented May 10, 2023 • edited Loading

Codecov Report

mih commented May 11, 2023 • edited Loading

jsheunis commented May 11, 2023 • edited Loading

mslw commented May 11, 2023 • edited Loading

mih commented May 11, 2023

mih commented May 11, 2023

jsheunis commented May 11, 2023

`iter_collections` and `ls-file-collection` #343

`iter_collections` and `ls-file-collection` #343

mih commented May 8, 2023 •

edited

Loading

codecov bot commented May 10, 2023 •

edited

Loading

mih commented May 11, 2023 •

edited

Loading

jsheunis commented May 11, 2023 •

edited

Loading

mslw commented May 11, 2023 •

edited

Loading