Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iter_gitworktree() for processing work tree content #361

Merged
merged 3 commits into from
Jun 5, 2023

Conversation

mih
Copy link
Member

@mih mih commented May 17, 2023

The iterator is also integrated with ls-file-collection as collection type gitworktree.

The integration in ls-file-collection does not expose all features. For some additional insight run something like

python -c 'from pathlib import Path; from datalad_next.iter_collections.gitworktree import iter_gitworktree; from pprint import pprint; pprint(list(iter_gitworktree(Path("."), include_untracked=None, link_target=True)))'

Closes #350
Ping #323

TODO

@jsheunis
Copy link
Member

Great work. I played with it a bit, and it was nice and smooth.

A couple of comments and questions:

  • it's really fast!
  • what is the reasoning behind incorporating fewer options into the call via CLI?
  • IIUC, the custom results renderer of ls-file-collection does not render any information in the expected style of ls -l (permissions, mtime, etc) when the collection type is gitworktree, because this information is not provided by the iterator. Would it make sense (or be possible?) to use this renderer only for other collection types that do? Without this information, the rendering looks a bit strange (my example pasted below), and I would suspect especially to those who don't know what to expect.
datalad ls-file-collection gitworktree .
?---------      -                       .appveyor.yml (file)
?---------      -                       .codeclimate.yml (file)
?---------      -                       .codespellrc (file)
?---------      -                       .coveragerc (file)
?---------      -                       .gitattributes (file)
?---------      -                       .github/PULL_REQUEST_TEMPLATE.md (file)
?---------      -                       .github/workflows/codespell.yml (file)
?---------      -                       .github/workflows/docbuild.yml (file)
?---------      -                       .github/workflows/test_crippledfs.yml (file)
?---------      -                       .gitignore (file)
?---------      -                       .noannex (file)
?---------      -                       .zenodo.json (file)
?---------      -                       CHANGELOG.md (file)
?---------      -                       CITATION.cff (file)
?---------      -                       CONTRIBUTING.md (file)
?---------      -                       CONTRIBUTORS (file)
?---------      -                       LICENSE (file)
?---------      -                       MANIFEST.in (file)
?---------      -                       Makefile (file)
?---------      -                       README.md (file)
?---------      -                       _datalad_buildsupport/__init__.py (file)
?---------      -                       _datalad_buildsupport/formatters.py (file)
?---------      -                       _datalad_buildsupport/setup.py (file)
?---------      -                       changelog.d/20230120_083014_michael.hanke_patchupdate.md (file)
?---------      -                       changelog.d/20230324_122025_michael.hanke_deprecate_decorator.md (file)
?---------      -                       changelog.d/20230503_133855_michael.hanke_progress.md (file)
?---------      -                       changelog.d/20230505_074522_michael.hanke_httphdrs.md (file)
?---------      -                       changelog.d/20230508_072605_michael.hanke_cerror.md (file)
?---------      -                       changelog.d/20230508_211747_michael.hanke_multihash.md (file)
?---------      -                       changelog.d/README (file)
?---------      -                       changelog.d/scriv.ini (file)
?---------      -                       changelog.d/templates/entry_title.md.j2 (file)
?---------      -                       changelog.d/templates/new_fragment.md.j2 (file)
?---------      -                       datalad_next/__init__.py (file)
?---------      -                       datalad_next/_version.py (file)
?---------      -                       datalad_next/annexbackends/__init__.py (file)
?---------      -                       datalad_next/annexbackends/base.py (file)
?---------      -                       datalad_next/annexbackends/tests/__init__.py (file)
?---------      -                       datalad_next/annexbackends/tests/test_base.py (file)
?---------      -                       datalad_next/annexbackends/xdlra.py (file)
?---------      -                       datalad_next/annexremotes/__init__.py (file)
?---------      -                       datalad_next/annexremotes/tests/__init__.py (file)
?---------      -                       datalad_next/annexremotes/tests/test_uncurl.py (file)
?---------      -                       datalad_next/annexremotes/uncurl.py (file)
?---------      -                       datalad_next/commands/__init__.py (file)
?---------      -                       datalad_next/commands/create_sibling_webdav.py (file)
?---------      -                       datalad_next/commands/credentials.py (file)
?---------      -                       datalad_next/commands/download.py (file)
?---------      -                       datalad_next/commands/ls_file_collection.py (file)
?---------      -                       datalad_next/commands/tests/__init__.py (file)
?---------      -                       datalad_next/commands/tests/test_create_sibling_webdav.py (file)
?---------      -                       datalad_next/commands/tests/test_credentials.py (file)
?---------      -                       datalad_next/commands/tests/test_download.py (file)
?---------      -                       datalad_next/commands/tests/test_ls_file_collection.py (file)
?---------      -                       datalad_next/commands/tests/test_tree.py (file)
?---------      -                       datalad_next/commands/tree.py (file)
?---------      -                       datalad_next/conftest.py (file)
?---------      -                       datalad_next/constraints/__init__.py (file)
?---------      -                       datalad_next/constraints/base.py (file)
?---------      -                       datalad_next/constraints/basic.py (file)
?---------      -                       datalad_next/constraints/compound.py (file)
?---------      -                       datalad_next/constraints/dataset.py (file)
?---------      -                       datalad_next/constraints/exceptions.py (file)
?---------      -                       datalad_next/constraints/formats.py (file)
?---------      -                       datalad_next/constraints/git.py (file)
?---------      -                       datalad_next/constraints/parameter.py (file)
?---------      -                       datalad_next/constraints/parameter_legacy.py (file)
?---------      -                       datalad_next/constraints/tests/__init__.py (file)
?---------      -                       datalad_next/constraints/tests/test_base.py (file)
?---------      -                       datalad_next/constraints/tests/test_basic.py (file)
?---------      -                       datalad_next/constraints/tests/test_cmdarg_validation.py (file)
?---------      -                       datalad_next/constraints/tests/test_compound.py (file)
?---------      -                       datalad_next/constraints/tests/test_exceptions.py (file)
?---------      -                       datalad_next/constraints/tests/test_special_purpose.py (file)
?---------      -                       datalad_next/constraints/tests/test_tutorial.py (file)
?---------      -                       datalad_next/constraints/utils.py (file)
?---------      -                       datalad_next/credman/__init__.py (file)
?---------      -                       datalad_next/credman/manager.py (file)
?---------      -                       datalad_next/credman/tests/__init__.py (file)
?---------      -                       datalad_next/credman/tests/test_credman.py (file)
?---------      -                       datalad_next/datasets/__init__.py (file)
?---------      -                       datalad_next/exceptions/__init__.py (file)
?---------      -                       datalad_next/gitremotes/__init__.py (file)
?---------      -                       datalad_next/gitremotes/datalad_annex.py (file)
?---------      -                       datalad_next/gitremotes/tests/__init__.py (file)
?---------      -                       datalad_next/gitremotes/tests/test_datalad_annex.py (file)
?---------      -                       datalad_next/iter_collections/__init__.py (file)
?---------      -                       datalad_next/iter_collections/directory.py (file)
?---------      -                       datalad_next/iter_collections/gitworktree.py (file)
?---------      -                       datalad_next/iter_collections/tarfile.py (file)
?---------      -                       datalad_next/iter_collections/tests/__init__.py (file)
?---------      -                       datalad_next/iter_collections/tests/test_iterdir.py (file)
?---------      -                       datalad_next/iter_collections/tests/test_itertar.py (file)
?---------      -                       datalad_next/iter_collections/utils.py (file)
?---------      -                       datalad_next/patches/__init__.py (file)
?---------      -                       datalad_next/patches/annexrepo.py (file)
?---------      -                       datalad_next/patches/common_cfg.py (file)
?---------      -                       datalad_next/patches/configuration.py (file)
?---------      -                       datalad_next/patches/create_sibling_ghlike.py (file)
?---------      -                       datalad_next/patches/customremotes_main.py (file)
?---------      -                       datalad_next/patches/distribution_dataset.py (file)
?---------      -                       datalad_next/patches/interface_utils.py (file)
?---------      -                       datalad_next/patches/push_optimize.py (file)
?---------      -                       datalad_next/patches/push_to_export_remote.py (file)
?---------      -                       datalad_next/patches/siblings.py (file)
?---------      -                       datalad_next/patches/test_keyring.py (file)
?---------      -                       datalad_next/patches/tests/__init__.py (file)
?---------      -                       datalad_next/patches/tests/test_annex_progress_logging.py (file)
?---------      -                       datalad_next/patches/tests/test_configuration.py (file)
?---------      -                       datalad_next/patches/tests/test_create_sibling_ghlike.py (file)
?---------      -                       datalad_next/patches/tests/test_push.py (file)
?---------      -                       datalad_next/patches/tests/test_push_to_export_remote.py (file)
?---------      -                       datalad_next/runners/__init__.py (file)
?---------      -                       datalad_next/runners/protocols.py (file)
?---------      -                       datalad_next/tests/__init__.py (file)
?---------      -                       datalad_next/tests/fixtures.py (file)
?---------      -                       datalad_next/tests/test_common_cfg.py (file)
?---------      -                       datalad_next/tests/test_register.py (file)
?---------      -                       datalad_next/tests/test_testutils.py (file)
?---------      -                       datalad_next/tests/utils.py (file)
?---------      -                       datalad_next/uis/__init__.py (file)
?---------      -                       datalad_next/url_operations/__init__.py (file)
?---------      -                       datalad_next/url_operations/any.py (file)
?---------      -                       datalad_next/url_operations/file.py (file)
?---------      -                       datalad_next/url_operations/http.py (file)
?---------      -                       datalad_next/url_operations/ssh.py (file)
?---------      -                       datalad_next/url_operations/tests/__init__.py (file)
?---------      -                       datalad_next/url_operations/tests/test_any.py (file)
?---------      -                       datalad_next/url_operations/tests/test_file.py (file)
?---------      -                       datalad_next/url_operations/tests/test_http.py (file)
?---------      -                       datalad_next/url_operations/tests/test_ssh.py (file)
?---------      -                       datalad_next/utils/__init__.py (file)
?---------      -                       datalad_next/utils/consts.py (file)
?---------      -                       datalad_next/utils/credman.py (file)
?---------      -                       datalad_next/utils/deprecate.py (file)
?---------      -                       datalad_next/utils/http_helpers.py (file)
?---------      -                       datalad_next/utils/log.py (file)
?---------      -                       datalad_next/utils/multihash.py (file)
?---------      -                       datalad_next/utils/patch.py (file)
?---------      -                       datalad_next/utils/requests_auth.py (file)
?---------      -                       datalad_next/utils/tests/__init__.py (file)
?---------      -                       datalad_next/utils/tests/test_deprecated.py (file)
?---------      -                       datalad_next/utils/tests/test_multihash.py (file)
?---------      -                       docs/CODEOWNERS (file)
?---------      -                       docs/Makefile (file)
?---------      -                       docs/README.md (file)
?---------      -                       docs/policy/release-management.md (file)
?---------      -                       docs/source/_static/datalad_logo.png (file)
?---------      -                       docs/source/_templates/autosummary/class.rst (file)
?---------      -                       docs/source/_templates/autosummary/module.rst (file)
?---------      -                       docs/source/annex-backends.rst (file)
?---------      -                       docs/source/annex-specialremotes.rst (file)
?---------      -                       docs/source/api.rst (file)
?---------      -                       docs/source/cmd.rst (file)
?---------      -                       docs/source/conf.py (file)
?---------      -                       docs/source/developer_guide/constraints.rst (file)
?---------      -                       docs/source/developer_guide/contributing.rst (file)
?---------      -                       docs/source/developer_guide/index.rst (file)
?---------      -                       docs/source/git-remote-helpers.rst (file)
?---------      -                       docs/source/index.rst (file)
?---------      -                       docs/source/patches.rst (file)
?---------      -                       docs/source/pyutils.rst (file)
?---------      -                       pyproject.toml (file)
?---------      -                       readthedocs.yml (file)
?---------      -                       requirements-devel.txt (file)
?---------      -                       setup.cfg (file)
?---------      -                       setup.py (file)
?---------      -                       tools/appveyor/docker-load-httpbin (file)
?---------      -                       tools/appveyor/enable-ssh-login (file)
?---------      -                       tools/appveyor/env_setup.bat (file)
?---------      -                       tools/appveyor/install-git-annex (file)
?---------      -                       tools/appveyor/install-syspkgs (file)
?---------      -                       tools/appveyor/submit-coverage (file)
?---------      -                       tools/appveyor/submit-coverage.bat (file)
?---------      -                       tools/coverage-bin/datalad (symlink)
?---------      -                       tools/coverage-bin/git-annex-backend-XDLRA (symlink)
?---------      -                       tools/coverage-bin/git-annex-remote-datalad (symlink)
?---------      -                       tools/coverage-bin/git-annex-remote-datalad-archives (symlink)
?---------      -                       tools/coverage-bin/git-annex-remote-ora (symlink)
?---------      -                       tools/coverage-bin/git-annex-remote-uncurl (symlink)
?---------      -                       tools/coverage-bin/git-remote-datalad-annex (symlink)
?---------      -                       tools/coverage-bin/sitecustomize.py (file)
?---------      -                       tools/coverage-bin/with_coverage (file)
?---------      -                       tools/setup-git-identity (file)
?---------      -                       versioneer.py (file)

@mih
Copy link
Member Author

mih commented May 18, 2023

re limited options:

Right now it is too early IMHO to define a sophisticated interface for this command. Both necessary and desirable functionality are undetermined still. I believe we need more iterators and use cases to get a better idea first.

re result renderer:

Absolutely possible, but no priority. Also, run the same command with --hash and the renderer becomes more useful without any other change.

Long term, we could have some decision making on any incoming result to select the most appropriate renderer. However, any real-world use of this command that I can see would use the json renderer anyways. So this is more about eye candy than functionality IMHO.

@jsheunis
Copy link
Member

Thanks, that all seems sensible.

jsheunis
jsheunis previously approved these changes May 23, 2023
mih added a commit to mih/datalad-next that referenced this pull request Jun 1, 2023
Originally introduced in
datalad#361, but a need for this
has also been expressed elsewhere, hence now a standalone changeset.
@codecov
Copy link

codecov bot commented Jun 5, 2023

Codecov Report

Patch coverage: 86.91% and project coverage change: -0.07 ⚠️

Comparison is base (b690ca7) 92.51% compared to head (32c5810) 92.45%.

❗ Current head 32c5810 differs from pull request most recent head 758ca41. Consider uploading reports for the commit 758ca41 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #361      +/-   ##
==========================================
- Coverage   92.51%   92.45%   -0.07%     
==========================================
  Files         117      119       +2     
  Lines        8468     8572     +104     
==========================================
+ Hits         7834     7925      +91     
- Misses        634      647      +13     
Impacted Files Coverage Δ
datalad_next/iter_collections/utils.py 98.18% <ø> (ø)
datalad_next/commands/ls_file_collection.py 86.53% <39.13%> (-13.47%) ⬇️
datalad_next/iter_collections/gitworktree.py 100.00% <100.00%> (ø)
...ext/iter_collections/tests/test_itergitworktree.py 100.00% <100.00%> (ø)
datalad_next/runners/__init__.py 100.00% <100.00%> (ø)

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@mih mih force-pushed the iterworktree branch 2 times, most recently from 32c5810 to 0b16d98 Compare June 5, 2023 10:07
The iterator is also integrated with `ls-file-collection` as collection
type `gitworktree`.

Closes datalad#350
Ping datalad#323
Copy link
Collaborator

@mslw mslw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of a confirmation that it works for me than a review - here's my attempt on a worktree glob by file name extension:

pprint(
    [
        x
        for x in iter_gitworktree(ds.pathobj, untracked=None, link_target=True)
        if x.name.suffix == ".dat"
    ]
)

gives:

[GitWorktreeFileSystemItem(type=<FileSystemItemType.symlink: 'symlink'>, name=PurePosixPath('bar/file2.dat'), size=119, mtime=1685962565.9671404, mode=41471, uid=1001, gid=1001, link_target=PurePosixPath('../.git/annex/objects/FZ/8w/MD5E-s6--6f41758e57d7958213d129349ac20c8e.dat/MD5E-s6--6f41758e57d7958213d129349ac20c8e.dat'), fp=None, gitsha='ad2f7c351c5afd14cc29a6150752fa5e898d85f1', gittype=<GitTreeItemType.symlink: 'symlink'>),
 GitWorktreeFileSystemItem(type=<FileSystemItemType.symlink: 'symlink'>, name=PurePosixPath('file1.dat'), size=116, mtime=1685962553.4789724, mode=41471, uid=1001, gid=1001, link_target=PurePosixPath('.git/annex/objects/F1/gz/MD5E-s6--f2b33fb7b3d0eb95090a16060e6a24f9.dat/MD5E-s6--f2b33fb7b3d0eb95090a16060e6a24f9.dat'), fp=None, gitsha='455a9f4d07fe04aa39c2f4b87a15435a0e050a66', gittype=<GitTreeItemType.symlink: 'symlink'>)]

and can replace a low-level call to ls-files (that I see is the base for iter_gitworktree), while including more info (size, mtime) than the names alone.

pprint([PurePath(x) for x in ds.repo.call_git_items_(["ls-files", "*.dat"])])

christian-monch
christian-monch previously approved these changes Jun 5, 2023
Copy link
Contributor

@christian-monch christian-monch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I only have a few smaller remarks. [Edited]: The windows error seems genuine though

for content in runner.run():
# for each zerobyte-delimited "line" in the output
for line in line_splitter.process(content.decode('utf-8')):
yield line
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We chould run line_splitter.finish_processing(), after the content-loop, to verify that no unterminated content was sent by git ls-files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you suggest a code change. I am not much familiar with this functionality.

datalad_next/iter_collections/gitworktree.py Show resolved Hide resolved
datalad_next/iter_collections/gitworktree.py Outdated Show resolved Hide resolved
datalad_next/iter_collections/gitworktree.py Outdated Show resolved Hide resolved
@christian-monch
Copy link
Contributor

It seems like the windows error is genuine

@mih
Copy link
Member Author

mih commented Jun 5, 2023

It seems like the windows error is genuine

Did you look into it? From the log I cannot tell.

Copy link
Contributor

@christian-monch christian-monch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are failing on Windows because of comparison between PurePosixPath- and PureWindowsPath-instances.

I suggested a few patches to fix the tests. Alternative to the patches below we might think about changing the type of GitWorktreeItem.name to PurePath.

@mih
Copy link
Member Author

mih commented Jun 5, 2023

OK, thanks! So this really is #402

You patches seem to prefer the direction that leaves it to the particular iterator which path type to use. However, this would make iter_gitworktree return not only items of different type depending on parameters, but even the type of their names switch.

I am leaning towards either making everything PurePath, also here, or everything PurePosixPath, also in all other iterators.

@christian-monch
Copy link
Contributor

Generally, PurePath-instances are system-dependent. From the python doc:

| [PurePath is a] generic class that represents the system’s path flavour (instantiating it creates either a PurePosixPath
| or a PureWindowsPath [...]

@christian-monch
Copy link
Contributor

OK, thanks! So this really is #402

You patches seem to prefer the direction that leaves it to the particular iterator which path type to use. However, this would make iter_gitworktree return not only items of different type depending on parameters, but even the type of their names switch.

I am leaning towards either making everything PurePath, also here, or everything PurePosixPath, also in all other iterators.

I think it is worth a try and we might have to see where it leads us...

@mih
Copy link
Member Author

mih commented Jun 5, 2023

I implemented that change. Resolving related conversations now.

@mih
Copy link
Member Author

mih commented Jun 5, 2023

Test failure is all-familiar httpbin. Otherwise all splendid! Thanks for the reviews, this is going in.

@mih mih merged commit b4584db into datalad:main Jun 5, 2023
@mih mih deleted the iterworktree branch June 5, 2023 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement iterworktree() for ls-file-collection
4 participants