-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iter_collections
and ls-file-collection
#343
Conversation
I lack the experience and understanding wrt this functionality to comment specifically on the thought process or implementation, but just noting for the record that I ran a test locally and all seems to work nicely:
|
This is a start for consolidating common functionality scattered around various extensions into a single implementation (pattern). This changeset import `iterdir()` from `datalad-gooey`. In contrast to the original implementation, this new one is using a stricter approach to types, and overfits less to a dataset-aware use case. However, it is not meant to be the exclusive implementation, but merely a start and a place to migrate directory iterators into. Ping datalad#323
This is trying to be structurally similar to the `directory` collection implementation.
Brief look at performance. Comparing the envisioned CLI use for a TAR file to get all infos needed to satisfy
with the conventional equivalent (actually not, but in practice it would be possible to quickly grab the same info):
Test case is a 2.5GB
Promising! |
Only one path type enum, only one item dataclass (for now).
Tests are good now, the failure is httpbin accessibility |
Using a pre-crafted (300 byte) tarball that is put on github, and is downloaded once per session.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #343 +/- ##
==========================================
+ Coverage 91.98% 92.22% +0.23%
==========================================
Files 86 93 +7
Lines 7478 7730 +252
==========================================
+ Hits 6879 7129 +250
- Misses 599 601 +2
☔ View full report in Codecov by Sentry. |
Also include a check for these requirements that is executed in the error case (no performance critical), to inform developers about obvious implementation issues. Closes datalad#348
This is (also) an alternative approach to `add-archive-content`. In comparison to the former, this is largely metadata driven, and works without (local) extraction of a tarball. This saves storage overhead, and makes it possible to run some parts of the ingestion pipeline on a remote system. Closes datalad#183
Few more bits on performance. This time on the
With so many files, offset costs for argument validation etc. no longer matter much. Speed is about 1/3 of a plain Importantly, the actual iterator is on the same performance-level as |
(this comment disappeared from github when I posted it, so sorry if it's duplicated) I just realised that Some current differences (for the context of catalog file metadata generation; and based on my limited experience with
|
I had similar thoughts @jsheunis (thanks for the great summary!) - and it seems that "iterworktree" and "itergittree" which were already proposed in separate issues could help with that too. For non-git directories, I imagine that we could also add a "directory-recursive" collection that would be based on On the note of recursion - seems pretty obvious , but while Overall, on the few small examples that I tried the command behaved predictably, and the code looks very modular which will be nice for any potential additions (and to be clear, mentions of "other collections" above were just general comments that need not & should not be added in this pr). |
Hey @jsheunis @mslw just briefly confirming that this kind of reuse is indeed not only in scope, but more or less the focus. However, I want to point out that That being said, the iterators can be made more customizable, once it is clear how it would make sense. However, I would prefer and aim for multiple, clear, and focused implementations, rather than sophisticated swiss-army-knifes. Next up would be git-focused iterators
They should make it possible to drop a substantial amount of custom code from gooey. I did start thinking about annex-focused iterators, but I have not concluded on a concept yet (will file an issue once I know something). My current thinking is exploring the possibility of wrapping |
I will merge this now to enable further parallel incremental improvements. |
Thanks for the explanation @mih, it gives me better context. |
This PR is making
add-archive-content
(largely) obsolete. Closes #183Summary
iter_collections
to establish a location in the spirit of Consolidate "traverser" from gooey and metalad into a common implementation #323 . The idea is to use dataclasses instead of result records, and enums instead of string labels to report on collection content. Iterators also support hashing (as means to generate identifiers.iterdir()
fromdatalad-gooey
as an iterator for a directory as a collections (of directory content)itertar()
for tar file archives -- reporting aligned withiterdir()
ls-file-collection
command that can drive these iterators.This is somewhat in conflict with #342 due to suboptimal communication. A future discussion needs to sort out how the two approaches can be consolidated.
TODO:
hash
parameter ofls-file-collection
should be an instruction what to compute, if there is no hash already. If so, an iterator would no only need to report a hash, but also the type of the hash it is reporting.It could be a sensible middle-ground to have an iterator provide a zero-argument callable that provides a file-like that generic hash-computation can use, if the controlling command deems it necessary (e.g. a requested hash is not readily provided).
That being said, it is unlikely that a generic implementation of a read-from-some-file-like is going to be optimal. Considering the bandwith of possibilities and necessities for
download
(i.e. anything remote; streaming, chunks size, etc) alone cries for configurability. So maybe going down the path ofUrlOperations
which can hash on the go is a better approach. Maybe factor that out into a dedicated class, and simply make it easy to use...add-archive-content
#183 we need an iterator for archives (we could have two, a simple one for local archives, and a more sophisticated for remote archives (see Draft for Issue 183 list collection #342) -- but they would not necessarily differ much in the end, if hashing must be perform -- rather than being able to report an existing hash.https://doi.org/10.1038/s41597-022-01163-2auto
would do could/would change over time. Conflict would always be there (even if a had a gitrepo worktree, I might still just need a directory listing).auto
could maybe provide a bit of perceived convenience for interactive use, but that convenience would need to be bough with added complexity of auto-detection and for preserving its outcome, long-term. Not worth it IMHO.This seems to be related to the following doc snippet
https://bugs.python.org/issue42957
StrEnum
-- only available from Python 3.11 onwardsiter_collections
to 100%ls-file-collection
iter_collections
ls-file-collection
. Include a demo on replacingadd-archive-content
test_replace_add_archive_content
)Addressed issues
str(CommandParametrizationError(...))
with withKeyError
#348