Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WORK-IN-PROGRESS] Introduce the path walk API into Git for Windows #5146

Closed
wants to merge 45 commits into from
Closed
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
6354d7a
path-walk: introduce an object walk by path
derrickstolee Aug 29, 2024
c8e08c3
backfill: add builtin boilerplate
derrickstolee Jun 7, 2024
b05a276
backfill: basic functionality and tests
derrickstolee Sep 1, 2024
e02f7b3
backfill: add --batch-size=<n> option
derrickstolee Sep 1, 2024
4236e4f
backfill: add --sparse option
derrickstolee Sep 1, 2024
31c9b45
path-walk: allow consumer to specify object types
derrickstolee Sep 1, 2024
356abc9
backfill: assume --sparse when sparse-checkout is enabled
derrickstolee Sep 1, 2024
3a421ff
path-walk: allow visiting tags
derrickstolee Sep 9, 2024
b9471b6
survey: stub in new experimental `git-survey` command
jeffhostetler Apr 29, 2024
c4b3490
survey: add command line opts to select references
jeffhostetler Apr 29, 2024
ca37a49
survey: collect the set of requested refs
jeffhostetler Apr 29, 2024
0a20a17
survey: start pretty printing data in table form
derrickstolee Sep 1, 2024
91c4d57
survey: add object count summary
derrickstolee Sep 2, 2024
53632be
revision: create mark_trees_uninteresting_dense()
derrickstolee Sep 6, 2024
c63928e
survey: summarize total sizes by object type
derrickstolee Sep 2, 2024
3e9b671
path-walk: add prune_all_uninteresting option
derrickstolee Sep 4, 2024
af7d53f
survey: show progress during object walk
derrickstolee Sep 2, 2024
d192ae7
pack-objects: add --path-walk option
derrickstolee Sep 5, 2024
5f7e131
survey: add ability to track prioritized lists
derrickstolee Sep 2, 2024
ab0bc08
pack-objects: extract should_attempt_deltas()
derrickstolee Sep 6, 2024
bd8b5b5
survey: add report of "largest" paths
derrickstolee Sep 2, 2024
c6d4832
pack-objects: introduce GIT_TEST_PACK_PATH_WALK
derrickstolee Sep 6, 2024
c2092f0
p5313: add size comparison test
derrickstolee Aug 28, 2024
bbc57f7
repack: add --path-walk option
derrickstolee Sep 5, 2024
32fca07
pack-objects: enable --path-walk via config
derrickstolee Sep 5, 2024
c145b9e
pack-objects: add --full-name-hash option
derrickstolee Sep 7, 2024
72191a0
test-name-hash: add helper to compute name-hash functions
derrickstolee Sep 8, 2024
5039f03
p5314: add a size test for name-hash collisions
derrickstolee Sep 9, 2024
e43582c
scalar: enable path-walk during push via config
derrickstolee Sep 5, 2024
88fee5b
pack-objects: output debug info about deltas
derrickstolee Aug 28, 2024
d17e503
Merge branch 'backfill'
dscho Sep 15, 2024
d7e7283
Merge branch 'survey'
dscho Sep 15, 2024
98a5786
Merge branch 'pack-path-walk'
dscho Sep 15, 2024
9d0690a
Merge branch 'path-walk'
dscho Sep 15, 2024
556335a
fixup! survey: collect the set of requested refs
dscho Sep 15, 2024
69aa8d8
fixup! pack-objects: output debug info about deltas
dscho Sep 15, 2024
5001883
fixup! survey: summarize total sizes by object type
dscho Sep 15, 2024
3ab1bda
fixup! survey: add report of "largest" paths
dscho Sep 15, 2024
84c8a06
fixup! survey: summarize total sizes by object type
dscho Sep 15, 2024
16cd9a3
fixup! pack-objects: output debug info about deltas
dscho Sep 15, 2024
c8f1239
fixup! survey: start pretty printing data in table form
dscho Sep 15, 2024
b5c2265
fixup! survey: add object count summary
dscho Sep 15, 2024
fee8f88
fixup! survey: summarize total sizes by object type
dscho Sep 15, 2024
489ce0c
test-tool: add the `path-walk` subcommand
dscho Sep 17, 2024
9b78d40
fixup! test-tool: add the `path-walk` subcommand
dscho Sep 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
/git-apply
/git-archimport
/git-archive
/git-backfill
/git-bisect
/git-blame
/git-branch
Expand Down
60 changes: 60 additions & 0 deletions Documentation/git-backfill.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
git-backfill(1)
===============

NAME
----
git-backfill - Download missing objects in a partial clone


SYNOPSIS
--------
[verse]
'git backfill' [--batch-size=<n>] [--[no-]sparse]

DESCRIPTION
-----------

Blobless partial clones are created using `git clone --filter=blob:none`
and then configure the local repository such that the Git client avoids
downloading blob objects unless they are required for a local operation.
This initially means that the clone and later fetches download reachable
commits and trees but no blobs. Later operations that change the `HEAD`
pointer, such as `git checkout` or `git merge`, may need to download
missing blobs in order to complete their operation.

In the worst cases, commands that compute blob diffs, such as `git blame`,
become very slow as they download the missing blobs in single-blob
requests to satisfy the missing object as the Git command needs it. This
leads to multiple download requests and no ability for the Git server to
provide delta compression across those objects.

The `git backfill` command provides a way for the user to request that
Git downloads the missing blobs (with optional filters) such that the
missing blobs representing historical versions of files can be downloaded
in batches. The `backfill` command attempts to optimize the request by
grouping blobs that appear at the same path, hopefully leading to good
delta compression in the packfile sent by the server.

By default, `git backfill` downloads all blobs reachable from the `HEAD`
commit. This set can be restricted or expanded using various options.

OPTIONS
-------

--batch-size=<n>::
Specify a minimum size for a batch of missing objects to request
from the server. This size may be exceeded by the last set of
blobs seen at a given path. Default batch size is 16,000.

--[no-]sparse::
Only download objects if they appear at a path that matches the
current sparse-checkout. If the sparse-checkout feature is enabled,
then `--sparse` is assumed and can be disabled with `--no-sparse`.

SEE ALSO
--------
linkgit:git-clone[1].

GIT
---
Part of the linkgit:git[1] suite
2 changes: 2 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -1097,6 +1097,7 @@ LIB_OBJS += parse-options.o
LIB_OBJS += patch-delta.o
LIB_OBJS += patch-ids.o
LIB_OBJS += path.o
LIB_OBJS += path-walk.o
LIB_OBJS += pathspec.o
LIB_OBJS += pkt-line.o
LIB_OBJS += preload-index.o
Expand Down Expand Up @@ -1204,6 +1205,7 @@ BUILTIN_OBJS += builtin/am.o
BUILTIN_OBJS += builtin/annotate.o
BUILTIN_OBJS += builtin/apply.o
BUILTIN_OBJS += builtin/archive.o
BUILTIN_OBJS += builtin/backfill.o
BUILTIN_OBJS += builtin/bisect.o
BUILTIN_OBJS += builtin/blame.o
BUILTIN_OBJS += builtin/branch.o
Expand Down
1 change: 1 addition & 0 deletions builtin.h
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@ int cmd_am(int argc, const char **argv, const char *prefix);
int cmd_annotate(int argc, const char **argv, const char *prefix);
int cmd_apply(int argc, const char **argv, const char *prefix);
int cmd_archive(int argc, const char **argv, const char *prefix);
int cmd_backfill(int argc, const char **argv, const char *prefix);
int cmd_bisect(int argc, const char **argv, const char *prefix);
int cmd_blame(int argc, const char **argv, const char *prefix);
int cmd_branch(int argc, const char **argv, const char *prefix);
Expand Down
141 changes: 141 additions & 0 deletions builtin/backfill.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
#include "builtin.h"
#include "git-compat-util.h"
#include "config.h"
#include "parse-options.h"
#include "repository.h"
#include "commit.h"
#include "dir.h"
#include "environment.h"
#include "hex.h"
#include "tree.h"
#include "tree-walk.h"
#include "object.h"
#include "object-store-ll.h"
#include "oid-array.h"
#include "oidset.h"
#include "promisor-remote.h"
#include "strmap.h"
#include "string-list.h"
#include "revision.h"
#include "trace2.h"
#include "progress.h"
#include "packfile.h"
#include "path-walk.h"

static const char * const builtin_backfill_usage[] = {
N_("git backfill [--batch-size=<n>] [--[no-]sparse]"),
NULL
};

struct backfill_context {
struct repository *repo;
struct oid_array current_batch;
size_t batch_size;
int sparse;
};

static void clear_backfill_context(struct backfill_context *ctx)
{
oid_array_clear(&ctx->current_batch);
}

static void download_batch(struct backfill_context *ctx)
{
promisor_remote_get_direct(ctx->repo,
ctx->current_batch.oid,
ctx->current_batch.nr);
oid_array_clear(&ctx->current_batch);

/*
* We likely have a new packfile. Add it to the packed list to
* avoid possible duplicate downloads of the same objects.
*/
reprepare_packed_git(ctx->repo);
}

static int fill_missing_blobs(const char *path,
struct oid_array *list,
enum object_type type,
void *data)
{
struct backfill_context *ctx = data;

if (type != OBJ_BLOB)
BUG("fill_missing_blobs only takes blob objects");

for (size_t i = 0; i < list->nr; i++) {
off_t size = 0;
struct object_info info = OBJECT_INFO_INIT;
info.disk_sizep = &size;
if (oid_object_info_extended(the_repository,
&list->oid[i],
&info,
OBJECT_INFO_FOR_PREFETCH) ||
!size)
oid_array_append(&ctx->current_batch, &list->oid[i]);
}

if (ctx->current_batch.nr >= ctx->batch_size)
download_batch(ctx);

return 0;
}

static int do_backfill(struct backfill_context *ctx)
{
struct rev_info revs;
struct path_walk_info info = PATH_WALK_INFO_INIT;
int ret;

if (ctx->sparse) {
CALLOC_ARRAY(info.pl, 1);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to release the allocated memory somewhere?

if (get_sparse_checkout_patterns(info.pl))
return error(_("problem loading sparse-checkout"));
}

repo_init_revisions(ctx->repo, &revs, "");
handle_revision_arg("HEAD", &revs, 0, 0);

info.revs = &revs;
info.path_fn = fill_missing_blobs;
info.path_fn_data = ctx;

ret = walk_objects_by_path(&info);

/* Download the objects that did not fill a batch. */
if (!ret)
download_batch(ctx);

clear_backfill_context(ctx);
return ret;
}

int cmd_backfill(int argc, const char **argv, const char *prefix)
{
struct backfill_context ctx = {
.repo = the_repository,
.current_batch = OID_ARRAY_INIT,
.batch_size = 16000,
.sparse = 0,
};
struct option options[] = {
OPT_INTEGER(0, "batch-size", &ctx.batch_size,
N_("Minimun number of objects to request at a time")),
OPT_BOOL(0, "sparse", &ctx.sparse,
N_("Restrict the missing objects to the current sparse-checkout")),
OPT_END(),
};

if (argc == 2 && !strcmp(argv[1], "-h"))
usage_with_options(builtin_backfill_usage, options);

argc = parse_options(argc, argv, prefix, options, builtin_backfill_usage,
0);

git_config(git_default_config, NULL);

if (ctx.sparse < 0)
ctx.sparse = core_apply_sparse_checkout;

return do_backfill(&ctx);
}
1 change: 1 addition & 0 deletions command-list.txt
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ git-annotate ancillaryinterrogators
git-apply plumbingmanipulators complete
git-archimport foreignscminterface
git-archive mainporcelain
git-backfill mainporcelain history
git-bisect mainporcelain info
git-blame ancillaryinterrogators complete
git-branch mainporcelain history
Expand Down
1 change: 1 addition & 0 deletions git.c
Original file line number Diff line number Diff line change
Expand Up @@ -502,6 +502,7 @@ static struct cmd_struct commands[] = {
{ "annotate", cmd_annotate, RUN_SETUP },
{ "apply", cmd_apply, RUN_SETUP_GENTLY },
{ "archive", cmd_archive, RUN_SETUP_GENTLY },
{ "backfill", cmd_backfill, RUN_SETUP },
{ "bisect", cmd_bisect, RUN_SETUP },
{ "blame", cmd_blame, RUN_SETUP },
{ "branch", cmd_branch, RUN_SETUP | DELAY_PAGER_CONFIG },
Expand Down
Loading