Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add path walk API and its use in 'git pack-objects' #5171

Merged
merged 14 commits into from
Sep 25, 2024

Conversation

derrickstolee
Copy link

This is a follow up to #5157 as well as motivated by the RFC in gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a single commit and then expanding the new trees and blobs reachable from that commit that have not been visited yet. This means that objects arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches according to their type and path. This will walk all annotated tags, all commits, all root trees, and then start a depth-first search among all paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for Windows: git pack-objects --path-walk. This application of the path walk API discovers the objects to pack via this batched walk, and automatically groups objects that appear at a common path so they can be checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions that sometimes occur with the new --full-name-hash option) and can be much faster to compute since the first pass of delta calculations does not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.

Juma-creator

This comment was marked as spam.

Copy link
Member

@dscho dscho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, what an incredible amount of work!

I would like to work on a couple of rough edges before integrating this, no major work needed, therefore we should be able to get this integrated today, in time for tomorrow's -rc0.

Documentation/technical/api-path-walk.txt Outdated Show resolved Hide resolved
path-walk.c Show resolved Hide resolved
t/helper/test-path-walk.c Outdated Show resolved Hide resolved
t/helper/test-path-walk.c Outdated Show resolved Hide resolved
t/t6601-path-walk.sh Show resolved Hide resolved
path-walk.c Show resolved Hide resolved
path-walk.c Outdated Show resolved Hide resolved
t/helper/test-path-walk.c Outdated Show resolved Hide resolved
builtin/pack-objects.c Outdated Show resolved Hide resolved
Documentation/git-repack.txt Show resolved Hide resolved
@dscho dscho added this to the Next release milestone Sep 25, 2024
In anticipation of a few planned applications, introduce the most basic form
of a path-walk API. It currently assumes that there are no UNINTERESTING
objects, and does not include any complicated filters. It calls a function
pointer on groups of tree and blob objects as grouped by path. This only
includes objects the first time they are discovered, so an object that
appears at multiple paths will not be included in two batches.

There are many future adaptations that could be made, but they are left for
future updates when consumers are ready to take advantage of those features.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
@derrickstolee
Copy link
Author

Range-diff for latest push:
 1:  8b6ba9390c !  1:  bd4446ef3b path-walk: introduce an object walk by path
    @@ Documentation/technical/api-path-walk.txt (new)
     +paths is used to visit the object.
     +
     +When walking a range of commits with some `UNINTERESTING` objects, the
    -+objects with the `UNINTERESTING` flag are included in these batches.
    ++objects with the `UNINTERESTING` flag are included in these batches. In
    ++order to walk `UNINTERESTING` objects, the `--boundary` option must be
    ++used in the commit walk in order to visit `UNINTERESTING` commits.
     +
     +Basics
     +------
    @@ path-walk.c (new)
     +
     +/*
     + * For each path in paths_to_explore, walk the trees another level
    -+ * and add any found blobs to the batch (but only if they don't
    -+ * exist and haven't been added yet).
    ++ * and add any found blobs to the batch (but only if they exist and
    ++ * haven't been added yet).
     + */
     +static int walk_path(struct path_walk_context *ctx,
     +		     const char *path)
 2:  591438e6dc !  2:  4d1d9ae8df t6601: add helper for testing path-walk API
    @@ t/helper/test-path-walk.c (new)
     +#include "oid-array.h"
     +
     +struct path_walk_test_data {
    -+	uint32_t tree_nr;
    -+	uint32_t blob_nr;
    ++	size_t tree_nr;
    ++	size_t blob_nr;
     +};
     +
     +static int emit_block(const char *path, struct oid_array *oids,
    @@ t/helper/test-path-walk.c (new)
     +
     +	res = walk_objects_by_path(&info);
     +
    -+	printf("trees:%d\nblobs:%d\n",
    ++	printf("trees:%" PRIuMAX "\n"
    ++	       "blobs:%" PRIuMAX "\n",
     +	       data.tree_nr, data.blob_nr);
     +
     +	return res;
 3:  f2c5f20045 !  3:  9aad76f169 path-walk: allow consumer to specify object types
    @@ Metadata
      ## Commit message ##
         path-walk: allow consumer to specify object types
     
    -    This adds the ability to ask for the commits as a single list. This will
    -    also reduce the calls in 'git backfill' to be a BUG() statement if called
    -    with anything other than blobs.
    +    We add the ability to filter the object types in the path-walk API so
    +    the callback function is called fewer times.
    +
    +    This adds the ability to ask for the commits in a list, as well. Future
    +    changes will add the ability to visit annotated tags.
     
         Signed-off-by: Derrick Stolee <stolee@gmail.com>
     
    @@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
      		if (t)
      			oid_array_append(&root_tree_list->oids, oid);
      		else
    - 			warning("could not find tree %s", oid_to_hex(oid));
    -+
    - 	}
    - 
    +@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
      	trace2_data_intmax("path-walk", ctx.repo, "commits", commits_nr);
      	trace2_region_leave("path-walk", "commit-walk", info->revs->repo);
      
    @@ t/helper/test-path-walk.c
      #include "oid-array.h"
      
      struct path_walk_test_data {
    -+	uint32_t commit_nr;
    - 	uint32_t tree_nr;
    - 	uint32_t blob_nr;
    ++	size_t commit_nr;
    + 	size_t tree_nr;
    + 	size_t blob_nr;
      };
     @@ t/helper/test-path-walk.c: static int emit_block(const char *path, struct oid_array *oids,
      	const char *typestr;
    @@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv)
      
      	res = walk_objects_by_path(&info);
      
    --	printf("trees:%d\nblobs:%d\n",
    +-	printf("trees:%" PRIuMAX "\n"
    ++	printf("commits:%" PRIuMAX "\n"
    ++	       "trees:%" PRIuMAX "\n"
    + 	       "blobs:%" PRIuMAX "\n",
     -	       data.tree_nr, data.blob_nr);
    -+	printf("commits:%d\ntrees:%d\nblobs:%d\n",
     +	       data.commit_nr, data.tree_nr, data.blob_nr);
      
      	return res;
 4:  7521576f8c !  4:  0851c92c55 path-walk: allow visiting tags
    @@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
     +		info->path_fn("", &tags, OBJ_TAG, info->path_fn_data);
     +
     +		if (tagged_blob_list.nr && info->blobs)
    -+			info->path_fn("tagged-blobs", &tagged_blob_list, OBJ_BLOB,
    ++			info->path_fn("/tagged-blobs", &tagged_blob_list, OBJ_BLOB,
     +				      info->path_fn_data);
     +
     +		trace2_data_intmax("path-walk", ctx.repo, "tags", tags.nr);
    @@ path-walk.h: struct path_walk_info {
     
      ## t/helper/test-path-walk.c ##
     @@ t/helper/test-path-walk.c: struct path_walk_test_data {
    - 	uint32_t commit_nr;
    - 	uint32_t tree_nr;
    - 	uint32_t blob_nr;
    -+	uint32_t tag_nr;
    + 	size_t commit_nr;
    + 	size_t tree_nr;
    + 	size_t blob_nr;
    ++	size_t tag_nr;
      };
      
      static int emit_block(const char *path, struct oid_array *oids,
    @@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv)
      	}
     @@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv)
      
    - 	res = walk_objects_by_path(&info);
    - 
    --	printf("commits:%d\ntrees:%d\nblobs:%d\n",
    + 	printf("commits:%" PRIuMAX "\n"
    + 	       "trees:%" PRIuMAX "\n"
    +-	       "blobs:%" PRIuMAX "\n",
     -	       data.commit_nr, data.tree_nr, data.blob_nr);
    -+	printf("commits:%d\ntrees:%d\nblobs:%d\ntags:%d\n",
    ++	       "blobs:%" PRIuMAX "\n"
    ++	       "tags:%" PRIuMAX "\n",
     +	       data.commit_nr, data.tree_nr, data.blob_nr, data.tag_nr);
      
      	return res;
    @@ t/t6601-path-walk.sh: test_expect_success 'all' '
      	BLOB:right/c:$(git rev-parse topic:right/c)
      	BLOB:right/d:$(git rev-parse base~1:right/d)
     -	blobs:6
    -+	BLOB:tagged-blobs:$(git rev-parse refs/tags/blob-tag^{})
    -+	BLOB:tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
    ++	BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag^{})
    ++	BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
     +	BLOB:child/file:$(git rev-parse refs/tags/tree-tag^{}:child/file)
     +	blobs:10
     +	TAG::$(git rev-parse refs/tags/first)
 5:  d1af7aa423 =  5:  9ecc60d738 revision: create mark_trees_uninteresting_dense()
 6:  1571d7b177 !  6:  223b050ede path-walk: add prune_all_uninteresting option
    @@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
     +		} else {
      			warning("could not find tree %s", oid_to_hex(oid));
     +		}
    - 
    ++
     +		if (t && (c->object.flags & UNINTERESTING)) {
     +			t->object.flags |= UNINTERESTING;
     +			has_uninteresting = 1;
    @@ t/helper/test-path-walk.c: static int emit_block(const char *path, struct oid_ar
     +	for (size_t i = 0; i < oids->nr; i++) {
     +		struct object *o = lookup_unknown_object(the_repository,
     +							 &oids->oid[i]);
    -+		printf("%s:%s:%s", typestr, path, oid_to_hex(&oids->oid[i]));
    -+
    -+		if (o->flags & UNINTERESTING)
    -+			printf(":UNINTERESTING");
    -+		printf("\n");
    ++		printf("%s:%s:%s%s\n", typestr, path, oid_to_hex(&oids->oid[i]),
    ++		       o->flags & UNINTERESTING ? ":UNINTERESTING" : "");
     +	}
      
      	return 0;
 7:  e4d3374ec6 =  7:  787124f2ed pack-objects: extract should_attempt_deltas()
 8:  d10ebfd55a !  8:  775278f7b2 pack-objects: add --path-walk option
    @@ Commit message
     
         Signed-off-by: Derrick Stolee <stolee@gmail.com>
     
    + ## Documentation/git-pack-objects.txt ##
    +@@ Documentation/git-pack-objects.txt: SYNOPSIS
    + 	[--cruft] [--cruft-expiration=<time>]
    + 	[--stdout [--filter=<filter-spec>] | <base-name>]
    + 	[--shallow] [--keep-true-parents] [--[no-]sparse]
    +-	[--full-name-hash] < <object-list>
    ++	[--full-name-hash] [--path-walk] < <object-list>
    + 
    + 
    + DESCRIPTION
    +@@ Documentation/git-pack-objects.txt: raise an error.
    + 	Restrict delta matches based on "islands". See DELTA ISLANDS
    + 	below.
    + 
    ++--path-walk::
    ++	By default, `git pack-objects` walks objects in an order that
    ++	presents trees and blobs in an order unrelated to the path they
    ++	appear relative to a commit's root tree. The `--path-walk` option
    ++	enables a different walking algorithm that organizes trees and
    ++	blobs by path. This has the potential to improve delta compression
    ++	especially in the presence of filenames that cause collisions in
    ++	Git's default name-hash algorithm. Due to changing how the objects
    ++	are walked, this option is not compatible with `--delta-islands`,
    ++	`--shallow`, or `--filter`.
    + 
    + DELTA ISLANDS
    + -------------
    +
      ## Documentation/technical/api-path-walk.txt ##
     @@ Documentation/technical/api-path-walk.txt: Examples
      --------
    @@ builtin/pack-objects.c: static void mark_bitmap_preferred_tips(void)
     +static inline int is_oid_interesting(struct repository *repo,
     +				     struct object_id *oid)
     +{
    -+	struct object *o = lookup_unknown_object(repo, oid);
    ++	struct object *o = lookup_object(repo, oid);
     +	return o && !(o->flags & UNINTERESTING);
     +}
     +
 9:  095a10dbb0 =  9:  62a8358c73 pack-objects: introduce GIT_TEST_PACK_PATH_WALK
10:  b854dfcc52 ! 10:  de63133baa repack: add --path-walk option
    @@ Documentation/git-repack.txt: SYNOPSIS
      
      DESCRIPTION
      -----------
    +@@ Documentation/git-repack.txt: linkgit:git-multi-pack-index[1]).
    + 	Write a multi-pack index (see linkgit:git-multi-pack-index[1])
    + 	containing the non-redundant packs.
    + 
    ++--path-walk::
    ++	This option passes the `--path-walk` option to the underlying
    ++	`git pack-options` process (see linkgit:git-pack-objects[1]).
    ++	By default, `git pack-objects` walks objects in an order that
    ++	presents trees and blobs in an order unrelated to the path they
    ++	appear relative to a commit's root tree. The `--path-walk` option
    ++	enables a different walking algorithm that organizes trees and
    ++	blobs by path. This has the potential to improve delta compression
    ++	especially in the presence of filenames that cause collisions in
    ++	Git's default name-hash algorithm. Due to changing how the objects
    ++	are walked, this option is not compatible with `--delta-islands`
    ++	or `--filter`.
    ++
    + CONFIGURATION
    + -------------
    + 
     
      ## builtin/repack.c ##
     @@ builtin/repack.c: static char *packdir, *packtmp_name, *packtmp;
11:  3d42b0adc6 ! 11:  2be08bb37b pack-objects: enable --path-walk via config
    @@ Commit message
     
         Signed-off-by: Derrick Stolee <stolee@gmail.com>
     
    + ## Documentation/config/feature.txt ##
    +@@ Documentation/config/feature.txt: walking fewer objects.
    + +
    + * `pack.allowPackReuse=multi` may improve the time it takes to create a pack by
    + reusing objects from multiple packs instead of just one.
    +++
    ++* `pack.usePathWalk` may speed up packfile creation and make the packfiles be
    ++significantly smaller in the presence of certain filename collisions with Git's
    ++default name-hash.
    + 
    + feature.manyFiles::
    + 	Enable config options that optimize for repos with many files in the
    +
      ## Documentation/config/pack.txt ##
     @@ Documentation/config/pack.txt: pack.useSparse::
      	commits contain certain types of direct renames. Default is
12:  af3c37b26b = 12:  e0a1e66426 scalar: enable path-walk during push via config
13:  8084dd2024 = 13:  0238b0e4d0 pack-objects: refactor path-walk delta phase
14:  e91c3b9394 = 14:  946bd8f35c pack-objects: thread the path-based compression

@derrickstolee
Copy link
Author

My previous push forgot to add the parse-opts to test-tool path-walk, so I've done so in this push.

Range-diff since last push
 1:  bd4446ef3b =  1:  bd4446ef3b path-walk: introduce an object walk by path
 2:  4d1d9ae8df !  2:  30c651a8c5 t6601: add helper for testing path-walk API
    @@ t/helper/test-path-walk.c (new)
     +#include "pretty.h"
     +#include "revision.h"
     +#include "setup.h"
    ++#include "parse-options.h"
     +#include "path-walk.h"
     +#include "oid-array.h"
     +
    ++static const char * const path_walk_usage[] = {
    ++	N_("test-tool path-walk <options> -- <revision-options>"),
    ++	NULL
    ++};
    ++
     +struct path_walk_test_data {
     +	size_t tree_nr;
     +	size_t blob_nr;
    @@ t/helper/test-path-walk.c (new)
     +
     +int cmd__path_walk(int argc, const char **argv)
     +{
    -+	int argi, res;
    ++	int res;
     +	struct rev_info revs = REV_INFO_INIT;
     +	struct path_walk_info info = PATH_WALK_INFO_INIT;
     +	struct path_walk_test_data data = { 0 };
    ++	struct option options[] = {
    ++		OPT_END(),
    ++	};
     +
     +	initialize_repository(the_repository);
     +	setup_git_directory();
     +	revs.repo = the_repository;
     +
    -+	for (argi = 0; argi < argc; argi++) {
    -+		if (!strcmp(argv[argi], "--"))
    -+			break;
    -+	}
    ++	argc = parse_options(argc, argv, NULL,
    ++			     options, path_walk_usage,
    ++			     PARSE_OPT_KEEP_UNKNOWN_OPT | PARSE_OPT_KEEP_ARGV0);
     +
    -+	if (argi < argc)
    -+		setup_revisions(argc - argi, argv + argi, &revs, NULL);
    ++	if (argc > 1)
    ++		setup_revisions(argc, argv, &revs, NULL);
     +	else
    -+		die("usage: test-tool path-walk <options> -- <rev opts>");
    ++		usage(path_walk_usage[0]);
     +
     +	info.revs = &revs;
     +	info.path_fn = emit_block;
 3:  9aad76f169 !  3:  a0d51fb05d path-walk: allow consumer to specify object types
    @@ path-walk.h: struct path_walk_info {
       * Given the configuration of 'info', walk the commits based on 'info->revs' and
     
      ## t/helper/test-path-walk.c ##
    -@@
    - #include "oid-array.h"
    +@@ t/helper/test-path-walk.c: static const char * const path_walk_usage[] = {
    + };
      
      struct path_walk_test_data {
     +	size_t commit_nr;
    @@ t/helper/test-path-walk.c: static int emit_block(const char *path, struct oid_ar
      		typestr = "TREE";
      		tdata->tree_nr += oids->nr;
     @@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv)
    - 	revs.repo = the_repository;
    + 	struct path_walk_info info = PATH_WALK_INFO_INIT;
    + 	struct path_walk_test_data data = { 0 };
    + 	struct option options[] = {
    ++		OPT_BOOL(0, "blobs", &info.blobs,
    ++			 N_("toggle inclusion of blob objects")),
    ++		OPT_BOOL(0, "commits", &info.commits,
    ++			 N_("toggle inclusion of commit objects")),
    ++		OPT_BOOL(0, "trees", &info.trees,
    ++			 N_("toggle inclusion of tree objects")),
    + 		OPT_END(),
    + 	};
      
    - 	for (argi = 0; argi < argc; argi++) {
    -+		if (!strcmp(argv[argi], "--no-blobs"))
    -+			info.blobs = 0;
    -+		if (!strcmp(argv[argi], "--no-trees"))
    -+			info.trees = 0;
    -+		if (!strcmp(argv[argi], "--no-commits"))
    -+			info.commits = 0;
    - 		if (!strcmp(argv[argi], "--"))
    - 			break;
    - 	}
     @@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv)
      
      	res = walk_objects_by_path(&info);
 4:  0851c92c55 !  4:  fd1addc05c path-walk: allow visiting tags
    @@ t/helper/test-path-walk.c: static int emit_block(const char *path, struct oid_ar
      		BUG("we do not understand this type");
      	}
     @@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv)
    - 			info.trees = 0;
    - 		if (!strcmp(argv[argi], "--no-commits"))
    - 			info.commits = 0;
    -+		if (!strcmp(argv[argi], "--no-tags"))
    -+			info.tags = 0;
    - 		if (!strcmp(argv[argi], "--"))
    - 			break;
    - 	}
    + 			 N_("toggle inclusion of blob objects")),
    + 		OPT_BOOL(0, "commits", &info.commits,
    + 			 N_("toggle inclusion of commit objects")),
    ++		OPT_BOOL(0, "tags", &info.tags,
    ++			 N_("toggle inclusion of tag objects")),
    + 		OPT_BOOL(0, "trees", &info.trees,
    + 			 N_("toggle inclusion of tree objects")),
    + 		OPT_END(),
     @@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv)
      
      	printf("commits:%" PRIuMAX "\n"
 5:  9ecc60d738 =  5:  5de5815c92 revision: create mark_trees_uninteresting_dense()
 6:  223b050ede !  6:  d835e3b218 path-walk: add prune_all_uninteresting option
    @@ t/helper/test-path-walk.c: static int emit_block(const char *path, struct oid_ar
      	return 0;
      }
     @@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv)
    - 			info.commits = 0;
    - 		if (!strcmp(argv[argi], "--no-tags"))
    - 			info.tags = 0;
    -+		if (!strcmp(argv[argi], "--prune"))
    -+			info.prune_all_uninteresting = 1;
    - 		if (!strcmp(argv[argi], "--"))
    - 			break;
    - 	}
    + 			 N_("toggle inclusion of tag objects")),
    + 		OPT_BOOL(0, "trees", &info.trees,
    + 			 N_("toggle inclusion of tree objects")),
    ++		OPT_BOOL(0, "prune", &info.prune_all_uninteresting,
    ++			 N_("toggle pruning of uninteresting paths")),
    + 		OPT_END(),
    + 	};
    + 
     
      ## t/t6601-path-walk.sh ##
     @@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, boundary' '
 7:  787124f2ed =  7:  48761b8008 pack-objects: extract should_attempt_deltas()
 8:  775278f7b2 =  8:  fd41482cbc pack-objects: add --path-walk option
 9:  62a8358c73 =  9:  c82afd67e3 pack-objects: introduce GIT_TEST_PACK_PATH_WALK
10:  de63133baa = 10:  bfa052fcf1 repack: add --path-walk option
11:  2be08bb37b = 11:  6d22afb7c7 pack-objects: enable --path-walk via config
12:  e0a1e66426 = 12:  54de70e074 scalar: enable path-walk during push via config
13:  0238b0e4d0 = 13:  213f9bd986 pack-objects: refactor path-walk delta phase
14:  946bd8f35c = 14:  4b68c8e6ad pack-objects: thread the path-based compression

derrickstolee and others added 2 commits September 25, 2024 11:26
Add some tests based on the current behavior, doing interesting checks
for different sets of branches, ranges, and the --boundary option. This
sets a baseline for the behavior and we can extend it as new options are
introduced.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
We add the ability to filter the object types in the path-walk API so
the callback function is called fewer times.

This adds the ability to ask for the commits in a list, as well. Future
changes will add the ability to visit annotated tags.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
@derrickstolee
Copy link
Author

32-bit Linux complained about size_t and PRIuMAX so I changed them to uintmax_t in the test helper.

@dscho
Copy link
Member

dscho commented Sep 25, 2024

32-bit Linux complained about size_t and PRIuMAX so I changed them to uintmax_t in the test helper.

Good point!

Copy link
Member

@dscho dscho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for indulging me, all of my concerns are addressed!

In anticipation of using the path-walk API to analyze tags or include
them in a pack-file, add the ability to walk the tags that were included
in the revision walk.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
The sparse tree walk algorithm was created in d5d2e93 (revision:
implement sparse algorithm, 2019-01-16) and involves using the
mark_trees_uninteresting_sparse() method. This method takes a repository
and an oidset of tree IDs, some of which have the UNINTERESTING flag and
some of which do not.

Create a method that has an equivalent set of preconditions but uses a
"dense" walk (recursively visits all reachable trees, as long as they
have not previously been marked UNINTERESTING). This is an important
difference from mark_tree_uninteresting(), which short-circuits if the
given tree has the UNINTERESTING flag.

A use of this method will be added in a later change, with a condition
set whether the sparse or dense approach should be used.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
This option causes the path-walk API to act like the sparse tree-walk
algorithm implemented by mark_trees_uninteresting_sparse() in
list-objects.c.

Starting from the commits marked as UNINTERESTING, their root trees and
all objects reachable from those trees are UNINTERSTING, at least as we
walk path-by-path. When we reach a path where all objects associated
with that path are marked UNINTERESTING, then do no continue walking the
children of that path.

We need to be careful to pass the UNINTERESTING flag in a deep way on
the UNINTERESTING objects before we start the path-walk, or else the
depth-first search for the path-walk API may accidentally report some
objects as interesting.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
This will be helpful in a future change.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
In order to more easily compute delta bases among objects that appear at the
exact same path, add a --path-walk option to 'git pack-objects'.

This option will use the path-walk API instead of the object walk given by
the revision machinery. Since objects will be provided in batches
representing a common path, those objects can be tested for delta bases
immediately instead of waiting for a sort of the full object list by
name-hash. This has multiple benefits, including avoiding collisions by
name-hash.

The objects marked as UNINTERESTING are included in these batches, so we
are guaranteeing some locality to find good delta bases.

After the individual passes are done on a per-path basis, the default
name-hash is used to find other opportunistic delta bases that did not
match exactly by the full path name.

RFC TODO: It is important to note that this option is inherently
incompatible with using a bitmap index. This walk probably also does not
work with other advanced features, such as delta islands.

Getting ahead of myself, this option compares well with --full-name-hash
when the packfile is large enough, but also performs at least as well as
the default in all cases that I've seen.

RFC TODO: this should probably be recording the batch locations to another
list so they could be processed in a second phase using threads.

RFC TODO: list some examples of how this outperforms previous pack-objects
strategies. (This is coming in later commits that include performance
test changes.)

Signed-off-by: Derrick Stolee <stolee@gmail.com>
There are many tests that validate whether 'git pack-objects' works as
expected. Instead of duplicating these tests, add a new test environment
variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default
when specified.

This was useful in testing the implementation of the --path-walk
implementation, especially in conjunction with test such as:

 - t0411-clone-from-partial.sh : One test fetches from a repo that does
   not have the boundary objects. This causes the path-based walk to
   fail. Disable the variable for this test.

 - t5306-pack-nobase.sh : Similar to t0411, one test fetches from a repo
   without a boundary object.

 - t5310-pack-bitmaps.sh : One test compares the case when packing with
   bitmaps to the case when packing without them. Since we disable the
   test variable when writing bitmaps, this causes a difference in the
   object list (the --path-walk option adds an extra object). Specify
   --no-path-walk in both processes for the comparison. Another test
   checks for a specific delta base, but when computing dynamically
   without using bitmaps, the base object it too small to be considered
   in the delta calculations so no base is used.

 - t5316-pack-delta-depth.sh : This script cares about certain delta
   choices and their chain lengths. The --path-walk option changes how
   these chains are selected, and thus changes the results of this test.

 - t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of
   the --sparse option and how it combines with --path-walk.

 - t5332-multi-pack-reuse.sh : This test verifies that the preferred
   pack is used for delta reuse when possible. The --path-walk option is
   not currently aware of the preferred pack at all, so finds a
   different delta base.

 - t7406-submodule-update.sh : When using the variable, the --depth
   option collides with the --path-walk feature, resulting in a warning
   message. Disable the variable so this warning does not appear.

I want to call out one specific test change that is only temporary:

 - t5530-upload-pack-error.sh : One test cares specifically about an
   "unable to read" error message. Since the current implementation
   performs delta calculations within the path-walk API callback, a
   different "unable to get size" error message appears. When this
   is changed in a future refactoring, this test change can be reverted.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Since 'git pack-objects' supports a --path-walk option, allow passing it
through in 'git repack'. This presents interesting testing opportunities for
comparing the different repacking strategies against each other.

Add the --path-walk option to the performance tests in p5313.

For the microsoft/fluentui repo [1] checked out at a specific commit [2],
the results are very interesting:

Test                                           this tree
------------------------------------------------------------------
5313.2: thin pack                              0.40(0.47+0.04)
5313.3: thin pack size                                    1.2M
5313.4: thin pack with --full-name-hash        0.09(0.10+0.04)
5313.5: thin pack size with --full-name-hash             22.8K
5313.6: thin pack with --path-walk             0.08(0.06+0.02)
5313.7: thin pack size with --path-walk                  20.8K
5313.8: big pack                               2.16(8.43+0.23)
5313.9: big pack size                                    17.7M
5313.10: big pack with --full-name-hash        1.42(3.06+0.21)
5313.11: big pack size with --full-name-hash             18.0M
5313.12: big pack with --path-walk             2.21(8.39+0.24)
5313.13: big pack size with --path-walk                  17.8M
5313.14: repack                                98.05(662.37+2.64)
5313.15: repack size                                    449.1K
5313.16: repack with --full-name-hash          33.95(129.44+2.63)
5313.17: repack size with --full-name-hash              182.9K
5313.18: repack with --path-walk               106.21(121.58+0.82)
5313.19: repack size with --path-walk                   159.6K

[1] https://github.com/microsoft/fluentui
[2] e70848ebac1cd720875bccaa3026f4a9ed700e08

This repo suffers from having a lot of paths that collide in the name
hash, so examining them in groups by path leads to better deltas. Also,
in this case, the single-threaded implementation is competitive with the
full repack. This is saving time diffing files that have significant
differences from each other.

A similar, but private, repo has even more extremes in the thin packs:

Test                                           this tree
--------------------------------------------------------------
5313.2: thin pack                              2.39(2.91+0.10)
5313.3: thin pack size                                    4.5M
5313.4: thin pack with --full-name-hash        0.29(0.47+0.12)
5313.5: thin pack size with --full-name-hash             15.5K
5313.6: thin pack with --path-walk             0.35(0.31+0.04)
5313.7: thin pack size with --path-walk                  14.2K

Notice, however, that while the --full-name-hash version is working
quite well in these cases for the thin pack, it does poorly for some
other standard cases, such as this test on the Linux kernel repository:

Test                                           this tree
--------------------------------------------------------------
5313.2: thin pack                              0.01(0.00+0.00)
5313.3: thin pack size                                     310
5313.4: thin pack with --full-name-hash        0.00(0.00+0.00)
5313.5: thin pack size with --full-name-hash              1.4K
5313.6: thin pack with --path-walk             0.00(0.00+0.00)
5313.7: thin pack size with --path-walk                    310

Here, the --full-name-hash option does much worse than the default name
hash, but the path-walk option does exactly as well.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Users may want to enable the --path-walk option for 'git pack-objects' by
default, especially underneath commands like 'git push' or 'git repack'.

This should be limited to client repositories, since the --path-walk option
disables bitmap walks, so would be bad to include in Git servers when
serving fetches and clones. There is potential that it may be helpful to
consider when repacking the repository, to take advantage of improved deltas
across historical versions of the same files.

Much like how "pack.useSparse" was introduced and included in
"feature.experimental" before being enabled by default, use the repository
settings infrastructure to make the new "pack.usePathWalk" config enabled by
"feature.experimental" and "feature.manyFiles".

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Repositories registered with Scalar are expected to be client-only
repositories that are rather large. This means that they are more likely to
be good candidates for using the --path-walk option when running 'git
pack-objects', especially under the hood of 'git push'. Enable this config
in Scalar repositories.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Previously, the --path-walk option to 'git pack-objects' would compute
deltas inline with the path-walk logic. This would make the progress
indicator look like it is taking a long time to enumerate objects, and
then very quickly computed deltas.

Instead of computing deltas on each region of objects organized by tree,
store a list of regions corresponding to these groups. These can later
be pulled from the list for delta compression before doing the "global"
delta search.

This presents a new progress indicator that can be used in tests to
verify that this stage is happening.

The current implementation is not integrated with threads, but could be
done in a future update.

Since we do not attempt to sort objects by size until after exploring
all trees, we can remove the previous change to t5530 due to a different
error message appearing first.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Adapting the implementation of ll_find_deltas(), create a threaded
version of the --path-walk compression step in 'git pack-objects'.

This involves adding a 'regions' member to the thread_params struct,
allowing each thread to own a section of paths. We can simplify the way
jobs are split because there is no value in extending the batch based on
name-hash the way sections of the object entry array are attempted to be
grouped. We re-use the 'list_size' and 'remaining' items for the purpose
of borrowing work in progress from other "victim" threads when a thread
has finished its batch of work more quickly.

Using the Git repository as a test repo, the p5313 performance test
shows that the resulting size of the repo is the same, but the threaded
implementation gives gains of varying degrees depending on the number of
objects being packed. (This was tested on a 16-core machine.)

Test                                    HEAD~1    HEAD
-------------------------------------------------------------
5313.6: thin pack with --path-walk        0.01    0.01  +0.0%
5313.7: thin pack size with --path-walk    475     475  +0.0%
5313.12: big pack with --path-walk        1.99    1.87  -6.0%
5313.13: big pack size with --path-walk  14.4M   14.3M  -0.4%
5313.18: repack with --path-walk         98.14   41.46 -57.8%
5313.19: repack size with --path-walk   197.2M  197.3M  +0.0%

Signed-off-by: Derrick Stolee <stolee@gmail.com>
@derrickstolee derrickstolee merged commit 275c9db into git-for-windows:main Sep 25, 2024
44 checks passed
git-for-windows-ci pushed a commit that referenced this pull request Sep 25, 2024
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
git-for-windows-ci pushed a commit that referenced this pull request Sep 26, 2024
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
dscho added a commit that referenced this pull request Sep 26, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
git-for-windows-ci pushed a commit that referenced this pull request Sep 26, 2024
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
git-for-windows-ci pushed a commit that referenced this pull request Sep 26, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
dscho pushed a commit that referenced this pull request Sep 26, 2024
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
dscho added a commit that referenced this pull request Sep 26, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
dscho pushed a commit that referenced this pull request Sep 26, 2024
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
dscho added a commit that referenced this pull request Sep 26, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
dscho pushed a commit that referenced this pull request Sep 26, 2024
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
dscho added a commit that referenced this pull request Sep 26, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
dscho pushed a commit that referenced this pull request Sep 26, 2024
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
dscho added a commit that referenced this pull request Sep 26, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
dscho pushed a commit that referenced this pull request Sep 26, 2024
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
dscho added a commit that referenced this pull request Sep 26, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
dscho pushed a commit that referenced this pull request Sep 27, 2024
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
dscho added a commit that referenced this pull request Sep 27, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants