Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce 'git backfill' to get missing blobs in a partial clone #5172

Merged
merged 6 commits into from
Sep 26, 2024

Conversation

derrickstolee
Copy link

This change introduces the git backfill command which uses the path walk API to download missing blobs in a blobless partial clone.

By downloading blobs that correspond to the same file path at the same time, we hope to maximize the potential benefits of delta compression against multiple versions.

These downloads occur in a configurable batch size, presenting a mechanism to perform "resumable" clones: git clone --filter=blob:none gets the commits and trees, then git backfill will download all missing blobs. If git backfill is interrupted partway through, it can be restarted and will redownload only the missing objects.

When combining blobless partial clones with sparse-checkout, git backfill will assume its --sparse option and download only the blobs within the sparse-checkout. Users may want to do this as the repo size will still be smaller than the full repo size, but commands like git blame or git log -L will not suffer from many one-by-one blob downloads.

Future directions should consider adding a pathspec or file prefix to further focus which paths are being downloaded in a batch.

In anticipation of implementing 'git backfill', populate the necessary files
with the boilerplate of a new builtin.

RFC TODO: When preparing this for a full implementation, make sure it is
based on the newest standards introduced by [1].

[1] https://lore.kernel.org/git/xmqqjzfq2f0f.fsf@gitster.g/T/#m606036ea2e75a6d6819d6b5c90e729643b0ff7f7
    [PATCH 1/3] builtin: add a repository parameter for builtin functions

Signed-off-by: Derrick Stolee <stolee@gmail.com>
The default behavior of 'git backfill' is to fetch all missing blobs that
are reachable from HEAD. Document and test this behavior.

The implementation is a very simple use of the path-walk API, initializing
the revision walk at HEAD to start the path-walk from all commits reachable
from HEAD. Ignore the object arrays that correspond to tree entries,
assuming that they are all present already.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Users may want to specify a minimum batch size for their needs. This is only
a minimum: the path-walk API provides a list of OIDs that correspond to the
same path, and thus it is optimal to allow delta compression across those
objects in a single server request.

We could consider limiting the request to have a maximum batch size in the
future.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
One way to significantly reduce the cost of a Git clone and later fetches is
to use a blobless partial clone and combine that with a sparse-checkout that
reduces the paths that need to be populated in the working directory. Not
only does this reduce the cost of clones and fetches, the sparse-checkout
reduces the number of objects needed to download from a promisor remote.

However, history investigations can be expensie as computing blob diffs will
trigger promisor remote requests for one object at a time. This can be
avoided by downloading the blobs needed for the given sparse-checkout using
'git backfill' and its new '--sparse' mode, at a time that the user is
willing to pay that extra cost.

Note that this is distinctly different from the '--filter=sparse:<oid>'
option, as this assumes that the partial clone has all reachable trees and
we are using client-side logic to avoid downloading blobs outside of the
sparse-checkout cone. This avoids the server-side cost of walking trees
while also achieving a similar goal. It also downloads in batches based on
similar path names, presenting a resumable download if things are
interrupted.

This augments the path-walk API to have a possibly-NULL 'pl' member that may
point to a 'struct pattern_list'. This could be more general than the
sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently
the only consumer.

Be sure to test this in both cone mode and not cone mode. Cone mode has the
benefit that the path-walk can skip certain paths once they would expand
beyond the sparse-checkout.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
The previous change introduced the '--[no-]sparse' option for the 'git
backfill' command, but did not assume it as enabled by default. However,
this is likely the behavior that users will most often want to happen.
Without this default, users with a small sparse-checkout may be confused
when 'git backfill' downloads every version of every object in the full
history.

However, this is left as a separate change so this decision can be reviewed
independently of the value of the '--[no-]sparse' option.

Add a test of adding the '--sparse' option to a repo without sparse-checkout
to make it clear that supplying it without a sparse-checkout is an error.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
@derrickstolee derrickstolee self-assigned this Sep 25, 2024
#include "path-walk.h"

static const char * const builtin_backfill_usage[] = {
N_("git backfill [--batch-size=<n>] [--[no-]sparse]"),
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In before @dscho: this needs an (EXPERIMENTAL!) flag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See here: 3793d09 @derrickstolee!

This is a highly useful command, and we want it to get some testing "in
the wild". However, the patches have not yet been reviewed on the Git
mailing list, and are therefore subject to change. By marking the
command as experimental, users will be warned to pay attention to those
changes.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Copy link
Member

@dscho dscho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff! Most of the patches I saw before, in one form or other, so there were no surprises for me.

I've added one more commit (to mark the command as experimental), and plan on merging this tomorrow (so that we get a new snapshot, with a brand-new MSYS2 runtime, no less!).

@derrickstolee
Copy link
Author

Thanks, @dscho. One more big one to go!

@dscho dscho added this to the Next release milestone Sep 26, 2024
@dscho dscho merged commit 5e2e8b4 into git-for-windows:main Sep 26, 2024
44 checks passed
@dscho
Copy link
Member

dscho commented Sep 26, 2024

/add relnote feature The new, experimental git backfill command was added: It helps fetching relevant Git objects smartly in a partial, sparse clone.

The workflow run was started

github-actions bot pushed a commit to git-for-windows/build-extra that referenced this pull request Sep 26, 2024
The new, experimental `git backfill` command [was
added](git-for-windows/git#5172): It helps
fetching relevant Git objects smartly in a partial, sparse clone.

Signed-off-by: gitforwindowshelper[bot] <gitforwindowshelper-bot@users.noreply.github.com>
git-for-windows-ci pushed a commit that referenced this pull request Sep 26, 2024
This change introduces the `git backfill` command which uses the path
walk API to download missing blobs in a blobless partial clone.

By downloading blobs that correspond to the same file path at the same
time, we hope to maximize the potential benefits of delta compression
against multiple versions.

These downloads occur in a configurable batch size, presenting a
mechanism to perform "resumable" clones: `git clone --filter=blob:none`
gets the commits and trees, then `git backfill` will download all
missing blobs. If `git backfill` is interrupted partway through, it can
be restarted and will redownload only the missing objects.

When combining blobless partial clones with sparse-checkout, `git
backfill` will assume its `--sparse` option and download only the blobs
within the sparse-checkout. Users may want to do this as the repo size
will still be smaller than the full repo size, but commands like `git
blame` or `git log -L` will not suffer from many one-by-one blob
downloads.

Future directions should consider adding a pathspec or file prefix to
further focus which paths are being downloaded in a batch.
git-for-windows-ci pushed a commit that referenced this pull request Sep 26, 2024
This change introduces the `git backfill` command which uses the path
walk API to download missing blobs in a blobless partial clone.

By downloading blobs that correspond to the same file path at the same
time, we hope to maximize the potential benefits of delta compression
against multiple versions.

These downloads occur in a configurable batch size, presenting a
mechanism to perform "resumable" clones: `git clone --filter=blob:none`
gets the commits and trees, then `git backfill` will download all
missing blobs. If `git backfill` is interrupted partway through, it can
be restarted and will redownload only the missing objects.

When combining blobless partial clones with sparse-checkout, `git
backfill` will assume its `--sparse` option and download only the blobs
within the sparse-checkout. Users may want to do this as the repo size
will still be smaller than the full repo size, but commands like `git
blame` or `git log -L` will not suffer from many one-by-one blob
downloads.

Future directions should consider adding a pathspec or file prefix to
further focus which paths are being downloaded in a batch.
dscho added a commit that referenced this pull request Sep 26, 2024
This change introduces the `git backfill` command which uses the path
walk API to download missing blobs in a blobless partial clone.

By downloading blobs that correspond to the same file path at the same
time, we hope to maximize the potential benefits of delta compression
against multiple versions.

These downloads occur in a configurable batch size, presenting a
mechanism to perform "resumable" clones: `git clone --filter=blob:none`
gets the commits and trees, then `git backfill` will download all
missing blobs. If `git backfill` is interrupted partway through, it can
be restarted and will redownload only the missing objects.

When combining blobless partial clones with sparse-checkout, `git
backfill` will assume its `--sparse` option and download only the blobs
within the sparse-checkout. Users may want to do this as the repo size
will still be smaller than the full repo size, but commands like `git
blame` or `git log -L` will not suffer from many one-by-one blob
downloads.

Future directions should consider adding a pathspec or file prefix to
further focus which paths are being downloaded in a batch.
dscho added a commit that referenced this pull request Sep 26, 2024
This change introduces the `git backfill` command which uses the path
walk API to download missing blobs in a blobless partial clone.

By downloading blobs that correspond to the same file path at the same
time, we hope to maximize the potential benefits of delta compression
against multiple versions.

These downloads occur in a configurable batch size, presenting a
mechanism to perform "resumable" clones: `git clone --filter=blob:none`
gets the commits and trees, then `git backfill` will download all
missing blobs. If `git backfill` is interrupted partway through, it can
be restarted and will redownload only the missing objects.

When combining blobless partial clones with sparse-checkout, `git
backfill` will assume its `--sparse` option and download only the blobs
within the sparse-checkout. Users may want to do this as the repo size
will still be smaller than the full repo size, but commands like `git
blame` or `git log -L` will not suffer from many one-by-one blob
downloads.

Future directions should consider adding a pathspec or file prefix to
further focus which paths are being downloaded in a batch.
dscho added a commit that referenced this pull request Sep 26, 2024
This change introduces the `git backfill` command which uses the path
walk API to download missing blobs in a blobless partial clone.

By downloading blobs that correspond to the same file path at the same
time, we hope to maximize the potential benefits of delta compression
against multiple versions.

These downloads occur in a configurable batch size, presenting a
mechanism to perform "resumable" clones: `git clone --filter=blob:none`
gets the commits and trees, then `git backfill` will download all
missing blobs. If `git backfill` is interrupted partway through, it can
be restarted and will redownload only the missing objects.

When combining blobless partial clones with sparse-checkout, `git
backfill` will assume its `--sparse` option and download only the blobs
within the sparse-checkout. Users may want to do this as the repo size
will still be smaller than the full repo size, but commands like `git
blame` or `git log -L` will not suffer from many one-by-one blob
downloads.

Future directions should consider adding a pathspec or file prefix to
further focus which paths are being downloaded in a batch.
dscho added a commit that referenced this pull request Sep 26, 2024
This change introduces the `git backfill` command which uses the path
walk API to download missing blobs in a blobless partial clone.

By downloading blobs that correspond to the same file path at the same
time, we hope to maximize the potential benefits of delta compression
against multiple versions.

These downloads occur in a configurable batch size, presenting a
mechanism to perform "resumable" clones: `git clone --filter=blob:none`
gets the commits and trees, then `git backfill` will download all
missing blobs. If `git backfill` is interrupted partway through, it can
be restarted and will redownload only the missing objects.

When combining blobless partial clones with sparse-checkout, `git
backfill` will assume its `--sparse` option and download only the blobs
within the sparse-checkout. Users may want to do this as the repo size
will still be smaller than the full repo size, but commands like `git
blame` or `git log -L` will not suffer from many one-by-one blob
downloads.

Future directions should consider adding a pathspec or file prefix to
further focus which paths are being downloaded in a batch.
dscho added a commit that referenced this pull request Sep 26, 2024
This change introduces the `git backfill` command which uses the path
walk API to download missing blobs in a blobless partial clone.

By downloading blobs that correspond to the same file path at the same
time, we hope to maximize the potential benefits of delta compression
against multiple versions.

These downloads occur in a configurable batch size, presenting a
mechanism to perform "resumable" clones: `git clone --filter=blob:none`
gets the commits and trees, then `git backfill` will download all
missing blobs. If `git backfill` is interrupted partway through, it can
be restarted and will redownload only the missing objects.

When combining blobless partial clones with sparse-checkout, `git
backfill` will assume its `--sparse` option and download only the blobs
within the sparse-checkout. Users may want to do this as the repo size
will still be smaller than the full repo size, but commands like `git
blame` or `git log -L` will not suffer from many one-by-one blob
downloads.

Future directions should consider adding a pathspec or file prefix to
further focus which paths are being downloaded in a batch.
dscho added a commit that referenced this pull request Sep 27, 2024
This change introduces the `git backfill` command which uses the path
walk API to download missing blobs in a blobless partial clone.

By downloading blobs that correspond to the same file path at the same
time, we hope to maximize the potential benefits of delta compression
against multiple versions.

These downloads occur in a configurable batch size, presenting a
mechanism to perform "resumable" clones: `git clone --filter=blob:none`
gets the commits and trees, then `git backfill` will download all
missing blobs. If `git backfill` is interrupted partway through, it can
be restarted and will redownload only the missing objects.

When combining blobless partial clones with sparse-checkout, `git
backfill` will assume its `--sparse` option and download only the blobs
within the sparse-checkout. Users may want to do this as the repo size
will still be smaller than the full repo size, but commands like `git
blame` or `git log -L` will not suffer from many one-by-one blob
downloads.

Future directions should consider adding a pathspec or file prefix to
further focus which paths are being downloaded in a batch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants