Add experimental 'git survey' builtin #5174

derrickstolee · 2024-09-26T11:30:44Z

This introduces git survey to Git for Windows ahead of upstream for the express purpose of getting the path-based analysis in the hands of more folks.

The inspiration of this builtin is git-sizer, but since that command relies on git cat-file --batch to get the contents of objects, it has limits to how much information it can provide.

This is mostly a rewrite of the git survey builtin that was introduced into the microsoft/git fork in microsoft#667. That version had a lot more bells and whistles, including an analysis much closer to what git-sizer provides.

The biggest difference in this version is that this one is focused on using the path-walk API in order to visit batches of objects based on a common path. This allows identifying, for instance, the path that is contributing the most to the on-disk size across all versions at that path.

For example, here are the top ten paths contributing to my local Git repository (which includes microsoft/git and gitster/git):

TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603

This kind of analysis has been helpful in identifying the reasons for growth in a few internal monorepos. Those findings motivated the changes in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of the experimental tool in advance of it being contributed to the upstream project.

Unfortunately, this will mean that in the next microsoft/git rebase, @jeffhostetler's version will need to be pulled out since there are enough conflicts. These conflicts include how tables are stored and generated, as the version in this PR is slightly more general to allow for different kinds of data.

derrickstolee · 2024-09-26T11:31:39Z

@dscho: The whitespace break is intentional due to the way that the output is generated (there are 8 or more spaces in the left padding of the output). Is there a way to ignore that failure?

dscho · 2024-09-26T13:44:04Z

The whitespace break is intentional due to the way that the output is generated (there are 8 or more spaces in the left padding of the output). Is there a way to ignore that failure?

@derrickstolee yes, there are a couple of ways. 0071e3c works, even if it is ugly.

An alternative would have been to pull out the big hammer:

diff --git a/t/.gitattributes b/t/.gitattributes
index 7664c6e027d..7e9356613b5 100644
--- a/t/.gitattributes
+++ b/t/.gitattributes
@@ -23,3 +23,4 @@ t[0-9][0-9][0-9][0-9]/* -whitespace
 /t8005/*.txt eol=lf
 /t9*/*.dump eol=lf
 /t0040*.sh whitespace=-indent-with-non-tab
+/t8100*.sh whitespace=-indent-with-spaces

I do not like this because it affects the entire file, not just those two lines where we need an exemption.

dscho · 2024-09-26T14:10:43Z

This is mostly a rewrite of the git survey builtin that was introduced into the microsoft/git fork in microsoft#667. That version had a lot more bells and whistles, including an analysis much closer to what git-sizer provides.

[...]

Unfortunately, this will mean that in the next microsoft/git rebase, @jeffhostetler's version will need to be pulled out since there are enough conflicts. These conflicts include how tables are stored and generated, as the version in this PR is slightly more general to allow for different kinds of data.

I hope to be able to remedy that at least to some extent, as one of the biggest benefits of having @jeffhostetler's version of git survey was that it provided comparable functionality to git sizer albeit without the multi-megabyte penalty (which is the primary reason why git sizer is not distributed alongside Git for Windows). It might take until -rc1 for me to find the time, though.

dscho

🎉

I'd like to squash in the fixup!s, but any other suggestion I had is not really important enough to hold this PR up.

builtin/survey.c

dscho · 2024-09-26T14:46:12Z

builtin/survey.c

+	static struct strbuf lines = STRBUF_INIT;
+	size_t width = 0;
+	strbuf_setlen(&lines, 0);
+
+	strbuf_addch(&lines, ' ');
+	strbuf_addstr(&lines, name);
+	strbuf_addch(&lines, '\n');
+
+	for (size_t i = 0; i < nr; i++) {
+		if (i)
+			width += 3;
+		width += widths[i];
+	}
+	strbuf_addchars(&lines, '=', width);
+	printf("%s\n", lines.buf);


I feel like this can be simplified dramatically, but it may not be worth spending too many cycles on this. But if you like this, too, @derrickstolee, I would love to make it so:

const char *e = "======================================================================" "=============================================================================="; size_t width = (nr - 1) * 3; for (size_t i = 0; i < nr; i++) width += widths[i]; printf(" %s\n%.*s\n", name, (int)width, e);

Granted, it does not scale to insanely long widths, but it is so much simpler.

On the other hand, I think this entire table printing business could do with a context that contains a strbuf scratch buffer with which to work. I guess it's not the most pressing issue right now to simplify this.

I do like the width = (nr - 1) * 3; for sure. If the printf magic works, then excellent.

This works, but it's kind of ugly. I think I'd like to leave this for later, after all:

diff --git a/builtin/survey.c b/builtin/survey.c index 37296a5feb7..2c86d78d8a6 100644 --- a/builtin/survey.c +++ b/builtin/survey.c @@ -270,27 +270,16 @@ static void insert_table_rowv(struct survey_table *table, ...) static void print_table_title(const char *name, size_t *widths, size_t nr) { - static struct strbuf lines = STRBUF_INIT; - size_t width = 0; - size_t min_width; - strbuf_setlen(&lines, 0); + const char *e = "======================================================" + "==============================================================" + "============================================================="; + size_t width = (nr - 1) * 3, min_width = strlen(name); - strbuf_addch(&lines, '\n'); - strbuf_addstr(&lines, name); - min_width = lines.len - 1; - strbuf_addch(&lines, '\n'); - - for (size_t i = 0; i < nr; i++) { - if (i) - width += 3; + for (size_t i = 0; i < nr; i++) width += widths[i]; - } - - if (width < min_width) - width = min_width; - strbuf_addchars(&lines, '=', width); - printf("%s\n", lines.buf); + printf("\n%s\n%.*s\n", name, + (int)(min_width > width ? min_width : width), e); } static void print_row_plaintext(struct strvec *row, size_t *widths)

dscho · 2024-09-26T14:51:42Z

builtin/survey.c

-	strbuf_addch(&lines, ' ');
+	strbuf_addch(&lines, '\n');


This hunk probably wants to be squashed into the preceding commit.

builtin/survey.c

t/t8100-git-survey.sh

builtin/survey.c

t/t8100-git-survey.sh

Start work on a new 'git survey' command to scan the repository for monorepo performance and scaling problems. The goal is to measure the various known "dimensions of scale" and serve as a foundation for adding additional measurements as we learn more about Git monorepo scaling problems. The initial goal is to complement the scanning and analysis performed by the GO-based 'git-sizer' (https://github.com/github/git-sizer) tool. It is hoped that by creating a builtin command, we may be able to take advantage of internal Git data structures and code that is not accessible from GO to gain further insight into potential scaling problems. Co-authored-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Jeff Hostetler <git@jeffhostetler.com> Signed-off-by: Derrick Stolee <stolee@gmail.com>

By default we will scan all references in "refs/heads/", "refs/tags/" and "refs/remotes/". Add command line opts let the use ask for all refs or a subset of them and to include a detached HEAD. Signed-off-by: Jeff Hostetler <git@jeffhostetler.com> Signed-off-by: Derrick Stolee <stolee@gmail.com>

derrickstolee · 2024-09-26T16:28:32Z

Latest push should handle all of the review comments. Here's the range diff from @dscho's push with fixups:

Range diff

 1:  9751447967 =  1:  3a8cd93b06 survey: stub in new experimental 'git-survey' command
 2:  cd7bdb7046 !  2:  c08fa91a24 survey: add command line opts to select references
    @@ builtin/survey.c: static void survey_load_config(struct survey_context *ctx)
     +
     +	ctx->report.refs.refs_nr = ref_array.nr;
     +	for (size_t i = 0; i < ref_array.nr; i++) {
    -+		size_t size;
    ++		unsigned long size;
     +		struct ref_array_item *item = ref_array.items[i];
     +
     +		switch (item->kind) {
 3:  672c3e6bda !  3:  fe8544c4c8 survey: start pretty printing data in table form
    @@ builtin/survey.c: static void clear_survey_context(struct survey_context *ctx)
     +	table->rows_nr++;
     +}
     +
    ++#define SECTION_SEGMENT "========================================"
    ++#define SECTION_SEGMENT_LEN 40
    ++const char *section_line = SECTION_SEGMENT
    ++			   SECTION_SEGMENT
    ++			   SECTION_SEGMENT
    ++			   SECTION_SEGMENT;
    ++const size_t section_len = 4 * SECTION_SEGMENT_LEN;
    ++
     +static void print_table_title(const char *name, size_t *widths, size_t nr)
     +{
    -+	static struct strbuf lines = STRBUF_INIT;
    -+	size_t width = 0;
    -+	strbuf_setlen(&lines, 0);
    -+
    -+	strbuf_addch(&lines, ' ');
    -+	strbuf_addstr(&lines, name);
    -+	strbuf_addch(&lines, '\n');
    ++	size_t width = 3 * (nr - 1);
     +
    -+	for (size_t i = 0; i < nr; i++) {
    -+		if (i)
    -+			width += 3;
    ++	for (size_t i = 0; i < nr; i++)
     +		width += widths[i];
    -+	}
    -+	strbuf_addchars(&lines, '=', width);
    -+	printf("%s\n", lines.buf);
    ++
    ++	if (width > section_len)
    ++		width = section_len;
    ++
    ++	printf("\n%s\n%.*s\n", name, (int)width, section_line);
     +}
     +
     +static void print_row_plaintext(struct strvec *row, size_t *widths)
    @@ builtin/survey.c: static void clear_survey_context(struct survey_context *ctx)
     +
     +	for (size_t j = 0; j < table->rows_nr; j++)
     +		print_row_plaintext(&table->rows[j], column_widths);
    ++
    ++	free(column_widths);
     +}
     +
     +static void survey_report_plaintext_refs(struct survey_context *ctx)
    @@ builtin/survey.c: static void clear_survey_context(struct survey_context *ctx)
     +	strvec_push(&table.header, _("Count"));
     +
     +	if (ctx->opts.refs.want_all_refs || ctx->opts.refs.want_branches) {
    -+		char *fmt = xstrfmt("%"PRIuMAX"", refs->branches_nr);
    ++		char *fmt = xstrfmt("%"PRIuMAX"", (uintmax_t)refs->branches_nr);
     +		insert_table_rowv(&table, _("Branches"), fmt, NULL);
     +		free(fmt);
     +	}
     +
     +	if (ctx->opts.refs.want_all_refs || ctx->opts.refs.want_remotes) {
    -+		char *fmt = xstrfmt("%"PRIuMAX"", refs->remote_refs_nr);
    ++		char *fmt = xstrfmt("%"PRIuMAX"", (uintmax_t)refs->remote_refs_nr);
     +		insert_table_rowv(&table, _("Remote refs"), fmt, NULL);
     +		free(fmt);
     +	}
     +
     +	if (ctx->opts.refs.want_all_refs || ctx->opts.refs.want_tags) {
    -+		char *fmt = xstrfmt("%"PRIuMAX"", refs->tags_nr);
    ++		char *fmt = xstrfmt("%"PRIuMAX"", (uintmax_t)refs->tags_nr);
     +		insert_table_rowv(&table, _("Tags (all)"), fmt, NULL);
     +		free(fmt);
    -+		fmt = xstrfmt("%"PRIuMAX"", refs->tags_annotated_nr);
    ++		fmt = xstrfmt("%"PRIuMAX"", (uintmax_t)refs->tags_annotated_nr);
     +		insert_table_rowv(&table, _("Tags (annotated)"), fmt, NULL);
     +		free(fmt);
     +	}
    @@ t/t8100-git-survey.sh: test_expect_success 'create a semi-interesting repo' '
     -	test_line_count = 0 err
     +	test_line_count = 0 err &&
     +
    -+	cat >expect <<-EOF &&
    ++	tr , " " >expect <<-EOF &&
     +	GIT SURVEY for "$(pwd)"
     +	-----------------------------------------------------
    -+	 REFERENCES SUMMARY
    ++
    ++	REFERENCES SUMMARY
     +	========================
    -+		Ref Type | Count
    ++	,       Ref Type | Count
     +	-----------------+------
    -+		Branches |     1
    ++	,       Branches |     1
     +	     Remote refs |     0
     +	      Tags (all) |     0
     +	Tags (annotated) |     0
 4:  846c8a69f7 !  4:  9ab09921d5 survey: add object count summary
    @@ builtin/survey.c: struct survey_context {
      	strvec_clear(&ctx->refs);
      }
      
    -@@ builtin/survey.c: static void print_table_title(const char *name, size_t *widths, size_t nr)
    +@@ builtin/survey.c: const size_t section_len = 4 * SECTION_SEGMENT_LEN;
    + static void print_table_title(const char *name, size_t *widths, size_t nr)
      {
    - 	static struct strbuf lines = STRBUF_INIT;
    - 	size_t width = 0;
    -+	size_t min_width;
    - 	strbuf_setlen(&lines, 0);
    + 	size_t width = 3 * (nr - 1);
    ++	size_t min_width = strlen(name);
      
    --	strbuf_addch(&lines, ' ');
    -+	strbuf_addch(&lines, '\n');
    - 	strbuf_addstr(&lines, name);
    -+	min_width = lines.len - 1;
    - 	strbuf_addch(&lines, '\n');
    - 
    - 	for (size_t i = 0; i < nr; i++) {
    -@@ builtin/survey.c: static void print_table_title(const char *name, size_t *widths, size_t nr)
    - 			width += 3;
    + 	for (size_t i = 0; i < nr; i++)
      		width += widths[i];
    - 	}
    -+
    + 
     +	if (width < min_width)
     +		width = min_width;
     +
    - 	strbuf_addchars(&lines, '=', width);
    - 	printf("%s\n", lines.buf);
    - }
    + 	if (width > section_len)
    + 		width = section_len;
    + 
     @@ builtin/survey.c: static void survey_report_plaintext_refs(struct survey_context *ctx)
      	clear_table(&table);
      }
    @@ builtin/survey.c: static void survey_report_plaintext_refs(struct survey_context
     +	strvec_push(&table.header, _("Object Type"));
     +	strvec_push(&table.header, _("Count"));
     +
    -+	fmt = xstrfmt("%"PRIuMAX"", objs->tags_nr);
    ++	fmt = xstrfmt("%"PRIuMAX"", (uintmax_t)objs->tags_nr);
     +	insert_table_rowv(&table, _("Tags"), fmt, NULL);
     +	free(fmt);
     +
    -+	fmt = xstrfmt("%"PRIuMAX"", objs->commits_nr);
    ++	fmt = xstrfmt("%"PRIuMAX"", (uintmax_t)objs->commits_nr);
     +	insert_table_rowv(&table, _("Commits"), fmt, NULL);
     +	free(fmt);
     +
    -+	fmt = xstrfmt("%"PRIuMAX"", objs->trees_nr);
    ++	fmt = xstrfmt("%"PRIuMAX"", (uintmax_t)objs->trees_nr);
     +	insert_table_rowv(&table, _("Trees"), fmt, NULL);
     +	free(fmt);
     +
    -+	fmt = xstrfmt("%"PRIuMAX"", objs->blobs_nr);
    ++	fmt = xstrfmt("%"PRIuMAX"", (uintmax_t)objs->blobs_nr);
     +	insert_table_rowv(&table, _("Blobs"), fmt, NULL);
     +	free(fmt);
     +
    @@ builtin/survey.c: static void do_load_refs(struct survey_context *ctx,
     -	for (size_t i = 0; i < ref_array.nr; i++) {
     +	ctx->report.refs.refs_nr = ctx->ref_array.nr;
     +	for (size_t i = 0; i < ctx->ref_array.nr; i++) {
    - 		size_t size;
    + 		unsigned long size;
     -		struct ref_array_item *item = ref_array.items[i];
     +		struct ref_array_item *item = ctx->ref_array.items[i];
      
    @@ builtin/survey.c: static void survey_phase_refs(struct survey_context *ctx)
      
      	trace2_region_leave("survey", "phase/refs", ctx->repo);
     +}
    - 
    --	ref_array_clear(&ref_array);
    ++
     +static void increment_object_counts(
     +		struct survey_report_object_summary *summary,
     +		enum object_type type,
    @@ builtin/survey.c: static void survey_phase_refs(struct survey_context *ctx)
     +	case OBJ_COMMIT:
     +		summary->commits_nr += nr;
     +		break;
    -+
    + 
    +-	ref_array_clear(&ref_array);
     +	case OBJ_TREE:
     +		summary->trees_nr += nr;
     +		break;
    @@ t/t8100-git-survey.sh: test_expect_success 'git survey -h shows experimental war
     +	git survey --all-refs >out 2>err &&
      	test_line_count = 0 err &&
      
    - 	cat >expect <<-EOF &&
    - 	GIT SURVEY for "$(pwd)"
    - 	-----------------------------------------------------
    --	 REFERENCES SUMMARY
    -+
    -+	REFERENCES SUMMARY
    - 	========================
    --		Ref Type | Count
    -+	        Ref Type | Count
    + 	tr , " " >expect <<-EOF &&
    +@@ t/t8100-git-survey.sh: test_expect_success 'git survey (default)' '
      	-----------------+------
    --		Branches |     1
    -+	        Branches |     1
    + 	,       Branches |     1
      	     Remote refs |     0
     -	      Tags (all) |     0
     -	Tags (annotated) |     0
 5:  7fc3fc8022 !  5:  f4fde444fd survey: summarize total sizes by object type
    @@ builtin/survey.c: static void survey_report_plaintext_reachable_object_summary(s
     +	struct survey_table table = SURVEY_TABLE_INIT;
     +	table.table_name = title;
     +
    -+	strvec_push(&table.header, xstrdup(categories));
    -+	strvec_push(&table.header, xstrdup(_("Count")));
    -+	strvec_push(&table.header, xstrdup(_("Disk Size")));
    -+	strvec_push(&table.header, xstrdup(_("Inflated Size")));
    ++	strvec_push(&table.header, categories);
    ++	strvec_push(&table.header, _("Count"));
    ++	strvec_push(&table.header, _("Disk Size"));
    ++	strvec_push(&table.header, _("Inflated Size"));
     +
     +	for (size_t i = 0; i < summary_nr; i++) {
    -+		insert_table_rowv(&table, xstrdup(summary[i].label),
    -+				  xstrfmt("%"PRIuMAX, summary[i].nr),
    -+				  xstrfmt("%"PRIuMAX, summary[i].disk_size),
    -+				  xstrfmt("%"PRIuMAX, summary[i].inflated_size),
    -+				  NULL);
    ++		char *label_str =  xstrdup(summary[i].label);
    ++		char *nr_str = xstrfmt("%"PRIuMAX, (uintmax_t)summary[i].nr);
    ++		char *disk_str = xstrfmt("%"PRIuMAX, (uintmax_t)summary[i].disk_size);
    ++		char *inflate_str = xstrfmt("%"PRIuMAX, (uintmax_t)summary[i].inflated_size);
    ++
    ++		insert_table_rowv(&table, label_str, nr_str,
    ++				  disk_str, inflate_str, NULL);
    ++
    ++		free(label_str);
    ++		free(nr_str);
    ++		free(disk_str);
    ++		free(inflate_str);
     +	}
     +
     +	print_table_plaintext(&table);
    @@ builtin/survey.c: static void survey_phase_objects(struct survey_context *ctx)
      	walk_objects_by_path(&info);
     
      ## t/t8100-git-survey.sh ##
    +@@ t/t8100-git-survey.sh: test_expect_success 'git survey (default)' '
    + 	git survey --all-refs >out 2>err &&
    + 	test_line_count = 0 err &&
    + 
    ++	test_oid_cache <<-EOF &&
    ++	commits_size_on_disk sha1:     1523
    ++	commits_size_on_disk sha256:     1811
    ++
    ++	commits_size sha1:         2153
    ++	commits_size sha256:         2609
    ++
    ++	trees_size_on_disk sha1:      495
    ++	trees_size_on_disk sha256:      635
    ++
    ++	trees_size sha1:         1706
    ++	trees_size sha256:         2366
    ++
    ++	tags_size sha1:          528
    ++	tags_size sha256:          624
    ++
    ++	tags_size_on_disk sha1:      510
    ++	tags_size_on_disk sha256:      569
    ++	EOF
    ++
    + 	tr , " " >expect <<-EOF &&
    + 	GIT SURVEY for "$(pwd)"
    + 	-----------------------------------------------------
     @@ t/t8100-git-survey.sh: test_expect_success 'git survey (default)' '
      	    Commits |    10
      	      Trees |    10
    @@ t/t8100-git-survey.sh: test_expect_success 'git survey (default)' '
     +	===============================================
     +	Object Type | Count | Disk Size | Inflated Size
     +	------------+-------+-----------+--------------
    -+	    Commits |    10 |      1523 |          2153
    -+	      Trees |    10 |       495 |          1706
    ++	    Commits |    10 | $(test_oid commits_size_on_disk) | $(test_oid commits_size)
    ++	      Trees |    10 | $(test_oid trees_size_on_disk) | $(test_oid trees_size)
     +	      Blobs |    10 |       191 |           101
    -+	       Tags |     4 |       510 |           528
    ++	       Tags |     4 | $(test_oid tags_size_on_disk) | $(test_oid tags_size)
      	EOF
      
      	test_cmp expect out
 6:  448892d211 !  6:  21f75ebcf2 survey: show progress during object walk
    @@ t/t8100-git-survey.sh: test_expect_success 'create a semi-interesting repo' '
      '
      
     +test_expect_success 'git survey --progress' '
    -+	GIT_PROGRESS_DELATY=0 git survey --all-refs --progress >out 2>err &&
    ++	GIT_PROGRESS_DELAY=0 git survey --all-refs --progress >out 2>err &&
     +	grep "Preparing object walk" err
     +'
     +
 7:  d886c14550 !  7:  b6dd26b629 survey: add ability to track prioritized lists
    @@ builtin/survey.c: struct survey_report_object_size_summary {
      	size_t num_missing;
      };
      
    -+typedef int (*survey_top_size_cmp)(struct survey_report_object_size_summary *s1,
    -+				   struct survey_report_object_size_summary *s2);
    ++typedef int (*survey_top_cmp)(void *v1, void *v2);
     +
     +MAYBE_UNUSED
    -+static int cmp_by_nr(struct survey_report_object_size_summary *s1,
    -+		     struct survey_report_object_size_summary *s2)
    ++static int cmp_by_nr(void *v1, void *v2)
     +{
    ++	struct survey_report_object_size_summary *s1 = v1;
    ++	struct survey_report_object_size_summary *s2 = v2;
    ++
     +	if (s1->nr < s2->nr)
     +		return -1;
     +	if (s1->nr > s2->nr)
    @@ builtin/survey.c: struct survey_report_object_size_summary {
     +}
     +
     +MAYBE_UNUSED
    -+static int cmp_by_disk_size(struct survey_report_object_size_summary *s1,
    -+			    struct survey_report_object_size_summary *s2)
    ++static int cmp_by_disk_size(void *v1, void *v2)
     +{
    ++	struct survey_report_object_size_summary *s1 = v1;
    ++	struct survey_report_object_size_summary *s2 = v2;
    ++
     +	if (s1->disk_size < s2->disk_size)
     +		return -1;
     +	if (s1->disk_size > s2->disk_size)
    @@ builtin/survey.c: struct survey_report_object_size_summary {
     +}
     +
     +MAYBE_UNUSED
    -+static int cmp_by_inflated_size(struct survey_report_object_size_summary *s1,
    -+				struct survey_report_object_size_summary *s2)
    ++static int cmp_by_inflated_size(void *v1, void *v2)
     +{
    ++	struct survey_report_object_size_summary *s1 = v1;
    ++	struct survey_report_object_size_summary *s2 = v2;
    ++
     +	if (s1->inflated_size < s2->inflated_size)
     +		return -1;
     +	if (s1->inflated_size > s2->inflated_size)
    @@ builtin/survey.c: struct survey_report_object_size_summary {
     + * inserting a new category, reorder the list and free the one that
     + * got ejected (if any).
     + */
    -+struct survey_report_top_sizes {
    ++struct survey_report_top_table {
     +	const char *name;
    -+	survey_top_size_cmp cmp_fn;
    -+	struct survey_report_object_size_summary *data;
    ++	survey_top_cmp cmp_fn;
     +	size_t nr;
     +	size_t alloc;
    ++
    ++	/**
    ++	 * 'data' stores an array of structs and must be cast into
    ++	 * the proper array type before evaluating an index.
    ++	 */
    ++	void *data;
     +};
     +
     +MAYBE_UNUSED
    -+static void init_top_sizes(struct survey_report_top_sizes *top,
    ++static void init_top_sizes(struct survey_report_top_table *top,
     +			   size_t limit, const char *name,
    -+			   survey_top_size_cmp cmp)
    ++			   survey_top_cmp cmp)
     +{
    ++	struct survey_report_object_size_summary *sz_array;
    ++
     +	top->name = name;
    ++	top->cmp_fn = cmp;
     +	top->alloc = limit;
     +	top->nr = 0;
    -+	CALLOC_ARRAY(top->data, limit);
    -+	top->cmp_fn = cmp;
    ++
    ++	CALLOC_ARRAY(sz_array, limit);
    ++	top->data = sz_array;
     +}
     +
     +MAYBE_UNUSED
    -+static void clear_top_sizes(struct survey_report_top_sizes *top)
    ++static void clear_top_sizes(struct survey_report_top_table *top)
     +{
    ++	struct survey_report_object_size_summary *sz_array = top->data;
    ++
     +	for (size_t i = 0; i < top->nr; i++)
    -+		free(top->data[i].label);
    -+	free(top->data);
    ++		free(sz_array[i].label);
    ++	free(sz_array);
     +}
     +
     +MAYBE_UNUSED
    -+static void maybe_insert_into_top_size(struct survey_report_top_sizes *top,
    ++static void maybe_insert_into_top_size(struct survey_report_top_table *top,
     +				       struct survey_report_object_size_summary *summary)
     +{
    ++	struct survey_report_object_size_summary *sz_array = top->data;
     +	size_t pos = top->nr;
     +
     +	/* Compare against list from the bottom. */
    -+	while (pos > 0 && top->cmp_fn(&top->data[pos - 1], summary) < 0)
    ++	while (pos > 0 && top->cmp_fn(&sz_array[pos - 1], summary) < 0)
     +		pos--;
     +
     +	/* Not big enough! */
    @@ builtin/survey.c: struct survey_report_object_size_summary {
     +
     +	/* We need to shift the data. */
     +	if (top->nr == top->alloc)
    -+		free(top->data[top->nr - 1].label);
    ++		free(sz_array[top->nr - 1].label);
     +	else
     +		top->nr++;
     +
     +	for (size_t i = top->nr - 1; i > pos; i--)
    -+		memcpy(&top->data[i], &top->data[i - 1], sizeof(*top->data));
    ++		memcpy(&sz_array[i], &sz_array[i - 1], sizeof(*sz_array));
     +
    -+	memcpy(&top->data[pos], summary, sizeof(*summary));
    -+	top->data[pos].label = xstrdup(summary->label);
    ++	memcpy(&sz_array[pos], summary, sizeof(*summary));
    ++	sz_array[pos].label = xstrdup(summary->label);
     +}
     +
      /**
 8:  5eee0c9dff !  8:  13ccb6d3a2 survey: add report of "largest" paths
    @@ Commit message
     
      ## builtin/survey.c ##
     @@ builtin/survey.c: struct survey_report_object_size_summary {
    - 	size_t num_missing;
    - };
      
    --typedef int (*survey_top_size_cmp)(struct survey_report_object_size_summary *s1,
    --				   struct survey_report_object_size_summary *s2);
    -+typedef int (*survey_top_cmp)(void *v1,
    -+			      void *v2);
    + typedef int (*survey_top_cmp)(void *v1, void *v2);
      
     -MAYBE_UNUSED
    --static int cmp_by_nr(struct survey_report_object_size_summary *s1,
    --		     struct survey_report_object_size_summary *s2)
    -+static int cmp_by_nr(void *v1, void *v2)
    + static int cmp_by_nr(void *v1, void *v2)
      {
    -+	struct survey_report_object_size_summary *s1 = v1;
    -+	struct survey_report_object_size_summary *s2 = v2;
    -+
    - 	if (s1->nr < s2->nr)
    - 		return -1;
    - 	if (s1->nr > s2->nr)
    -@@ builtin/survey.c: static int cmp_by_nr(struct survey_report_object_size_summary *s1,
    + 	struct survey_report_object_size_summary *s1 = v1;
    +@@ builtin/survey.c: static int cmp_by_nr(void *v1, void *v2)
      	return 0;
      }
      
     -MAYBE_UNUSED
    --static int cmp_by_disk_size(struct survey_report_object_size_summary *s1,
    --			    struct survey_report_object_size_summary *s2)
    -+static int cmp_by_disk_size(void *v1, void *v2)
    + static int cmp_by_disk_size(void *v1, void *v2)
      {
    -+	struct survey_report_object_size_summary *s1 = v1;
    -+	struct survey_report_object_size_summary *s2 = v2;
    -+
    - 	if (s1->disk_size < s2->disk_size)
    - 		return -1;
    - 	if (s1->disk_size > s2->disk_size)
    -@@ builtin/survey.c: static int cmp_by_disk_size(struct survey_report_object_size_summary *s1,
    + 	struct survey_report_object_size_summary *s1 = v1;
    +@@ builtin/survey.c: static int cmp_by_disk_size(void *v1, void *v2)
      	return 0;
      }
      
     -MAYBE_UNUSED
    --static int cmp_by_inflated_size(struct survey_report_object_size_summary *s1,
    --				struct survey_report_object_size_summary *s2)
    -+static int cmp_by_inflated_size(void *v1, void *v2)
    + static int cmp_by_inflated_size(void *v1, void *v2)
      {
    -+	struct survey_report_object_size_summary *s1 = v1;
    -+	struct survey_report_object_size_summary *s2 = v2;
    -+
    - 	if (s1->inflated_size < s2->inflated_size)
    - 		return -1;
    - 	if (s1->inflated_size > s2->inflated_size)
    -@@ builtin/survey.c: static int cmp_by_inflated_size(struct survey_report_object_size_summary *s1,
    -  * inserting a new category, reorder the list and free the one that
    -  * got ejected (if any).
    -  */
    --struct survey_report_top_sizes {
    -+struct survey_report_top_table {
    - 	const char *name;
    --	survey_top_size_cmp cmp_fn;
    --	struct survey_report_object_size_summary *data;
    -+	survey_top_cmp cmp_fn;
    - 	size_t nr;
    - 	size_t alloc;
    -+
    -+	/**
    -+	 * 'data' stores an array of structs and must be cast into
    -+	 * the proper array type before evaluating an index.
    -+	 */
    -+	void *data;
    + 	struct survey_report_object_size_summary *s1 = v1;
    +@@ builtin/survey.c: struct survey_report_top_table {
    + 	void *data;
      };
      
     -MAYBE_UNUSED
    --static void init_top_sizes(struct survey_report_top_sizes *top,
    -+static void init_top_sizes(struct survey_report_top_table *top,
    + static void init_top_sizes(struct survey_report_top_table *top,
      			   size_t limit, const char *name,
    --			   survey_top_size_cmp cmp)
    -+			   survey_top_cmp cmp)
    - {
    -+	struct survey_report_object_size_summary *sz_array;
    -+
    - 	top->name = name;
    -+	top->cmp_fn = cmp;
    - 	top->alloc = limit;
    - 	top->nr = 0;
    --	CALLOC_ARRAY(top->data, limit);
    --	top->cmp_fn = cmp;
    -+
    -+	CALLOC_ARRAY(sz_array, limit);
    -+	top->data = sz_array;
    - }
    - 
    - MAYBE_UNUSED
    --static void clear_top_sizes(struct survey_report_top_sizes *top)
    -+static void clear_top_sizes(struct survey_report_top_table *top)
    - {
    -+	struct survey_report_object_size_summary *sz_array = top->data;
    - 	for (size_t i = 0; i < top->nr; i++)
    --		free(top->data[i].label);
    -+		free(sz_array[i].label);
    - 	free(top->data);
    + 			   survey_top_cmp cmp)
    +@@ builtin/survey.c: static void clear_top_sizes(struct survey_report_top_table *top)
    + 	free(sz_array);
      }
      
     -MAYBE_UNUSED
    --static void maybe_insert_into_top_size(struct survey_report_top_sizes *top,
    -+static void maybe_insert_into_top_size(struct survey_report_top_table *top,
    + static void maybe_insert_into_top_size(struct survey_report_top_table *top,
      				       struct survey_report_object_size_summary *summary)
      {
    -+	struct survey_report_object_size_summary *sz_array = top->data;
    - 	size_t pos = top->nr;
    - 
    - 	/* Compare against list from the bottom. */
    --	while (pos > 0 && top->cmp_fn(&top->data[pos - 1], summary) < 0)
    -+	while (pos > 0 && top->cmp_fn(&sz_array[pos - 1], summary) < 0)
    - 		pos--;
    - 
    - 	/* Not big enough! */
    -@@ builtin/survey.c: static void maybe_insert_into_top_size(struct survey_report_top_sizes *top,
    - 
    - 	/* We need to shift the data. */
    - 	if (top->nr == top->alloc)
    --		free(top->data[top->nr - 1].label);
    -+		free(sz_array[top->nr - 1].label);
    - 	else
    - 		top->nr++;
    - 
    - 	for (size_t i = top->nr - 1; i > pos; i--)
    --		memcpy(&top->data[i], &top->data[i - 1], sizeof(*top->data));
    -+		memcpy(&sz_array[i], &sz_array[i - 1], sizeof(*sz_array));
    - 
    --	memcpy(&top->data[pos], summary, sizeof(*summary));
    --	top->data[pos].label = xstrdup(summary->label);
    -+	memcpy(&sz_array[pos], summary, sizeof(*summary));
    -+	sz_array[pos].label = xstrdup(summary->label);
    - }
    - 
    - /**
     @@ builtin/survey.c: struct survey_report {
      	struct survey_report_object_summary reachable_objects;
      
    @@ builtin/survey.c: static int survey_objects_path_walk_fn(const char *path,
     
      ## t/t8100-git-survey.sh ##
     @@ t/t8100-git-survey.sh: test_expect_success 'git survey (default)' '
    - 	       Tags |     4 |       510 |           528
    + 	       Tags |     4 | $(test_oid tags_size_on_disk) | $(test_oid tags_size)
      	EOF
      
     -	test_cmp expect out
 9:  dac3510837 =  9:  f9351d993a survey: add --top=<N> option and config
10:  0071e3c0c1 <  -:  ---------- fixup! survey: add object count summary
11:  4e69b52010 <  -:  ---------- fixup! survey: add command line opts to select references
12:  4e9fddcb96 <  -:  ---------- fixup! survey: summarize total sizes by object type
13:  1d11819bc3 <  -:  ---------- fixup! survey: summarize total sizes by object type
14:  97efdc3e02 <  -:  ---------- fixup! survey: add object count summary
15:  269b053d0a <  -:  ---------- fixup! survey: start pretty printing data in table form
16:  9378ca9664 ! 10:  bce8ef63c8 survey: clearly note the experimental nature in the output
    @@ builtin/survey.c: int cmd_survey(int argc, const char **argv, const char *prefix
      	if (argc == 2 && !strcmp(argv[1], "-h"))
      		usage_with_options(survey_usage, survey_options);
      
    -+	color_fprintf_ln(stderr,
    -+			 want_color_fd(2, GIT_COLOR_AUTO) ? GIT_COLOR_YELLOW : "",
    -+			 "(THIS IS EXPERIMENTAL, EXPECT THE OUTPUT FORMAT TO CHANGE!)");
    ++	if (isatty(2))
    ++		color_fprintf_ln(stderr,
    ++				 want_color_fd(2, GIT_COLOR_AUTO) ? GIT_COLOR_YELLOW : "",
    ++				 "(THIS IS EXPERIMENTAL, EXPECT THE OUTPUT FORMAT TO CHANGE!)");
     +
      	ctx.repo = the_repository;
      
17:  5d4ed3b475 <  -:  ---------- fixup! survey: clearly note the experimental nature in the output
18:  e42c3e6abb <  -:  ---------- fixup! survey: show progress during object walk

When 'git survey' provides information to the user, this will be presented in one of two formats: plaintext and JSON. The JSON implementation will be delayed until the functionality is complete for the plaintext format. The most important parts of the plaintext format are headers specifying the different sections of the report and tables providing concreted data. Create a custom table data structure that allows specifying a list of strings for the row values. When printing the table, check each column for the maximum width so we can create a table of the correct size from the start. The table structure is designed to be flexible to the different kinds of output that will be implemented in future changes. Signed-off-by: Derrick Stolee <stolee@gmail.com>

At the moment, nothing is obvious about the reason for the use of the path-walk API, but this will become more prevelant in future iterations. For now, use the path-walk API to sum up the counts of each kind of object. For example, this is the reachable object summary output for my local repo: REACHABLE OBJECT SUMMARY ======================== Object Type | Count ------------+------- Tags | 1343 Commits | 179344 Trees | 314350 Blobs | 184030 Signed-off-by: Derrick Stolee <stolee@gmail.com>

Now that we have explored objects by count, we can expand that a bit more to summarize the data for the on-disk and inflated size of those objects. This information is helpful for diagnosing both why disk space (and perhaps clone or fetch times) is growing but also why certain operations are slow because the inflated size of the abstract objects that must be processed is so large. Signed-off-by: Derrick Stolee <stolee@gmail.com>

Signed-off-by: Derrick Stolee <stolee@gmail.com>

In future changes, we will make use of these methods. The intention is to keep track of the top contributors according to some metric. We don't want to store all of the entries and do a sort at the end, so track a constant-size table and remove rows that get pushed out depending on the chosen sorting algorithm. Co-authored-by: Jeff Hostetler <git@jeffhostetler.com> Signed-off-by; Jeff Hostetler <git@jeffhostetler.com> Signed-off-by: Derrick Stolee <stolee@gmail.com>

Since we are already walking our reachable objects using the path-walk API, let's now collect lists of the paths that contribute most to different metrics. Specifically, we care about * Number of versions. * Total size on disk. * Total inflated size (no delta or zlib compression). This information can be critical to discovering which parts of the repository are causing the most growth, especially on-disk size. Different packing strategies might help compress data more efficiently, but the toal inflated size is a representation of the raw size of all snapshots of those paths. Even when stored efficiently on disk, that size represents how much information must be processed to complete a command such as 'git blame'. Since the on-disk size is likely to be fragile, stop testing the exact output of 'git survey' and check that the correct set of headers is output. Signed-off-by: Derrick Stolee <stolee@gmail.com>

The 'git survey' builtin provides several detail tables, such as "top files by on-disk size". The size of these tables defaults to 100, currently. Allow the user to specify this number via a new --top=<N> option or the new survey.top config key. Signed-off-by: Derrick Stolee <stolee@gmail.com>

While this command is definitely something we _want_, chances are that upstreaming this will require substantial changes. We still want to be able to experiment with this before that, to focus on what we need out of this command: To assist with diagnosing issues with large repositories, as well as to help monitoring the growth and the associated painpoints of such repositories. To that end, we are about to integrate this command into `microsoft/git`, to get the tool into the hands of users who need it most, with the idea to iterate in close collaboration between these users and the developers familar with Git's internals. However, we will definitely want to avoid letting anybody have the impression that this command, its exact inner workings, as well as its output format, are anywhere close to stable. To make that fact utterly clear (and thereby protect the freedom to iterate and innovate freely before upstreaming the command), let's mark its output as experimental in all-caps, as the first thing we do. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

dscho · 2024-09-26T18:25:49Z

/add relnote feature The new, experimental git survey command was added. This command is designed to help identify less-than-ideal data shape in monorepos, and it will likely see highly active development. Stay tuned!

The workflow run was started

The new, experimental [`git survey` command was added](git-for-windows/git#5174). This command is designed to help identify less-than-ideal data shape in monorepos, and it will likely see highly active development. Stay tuned! Signed-off-by: gitforwindowshelper[bot] <gitforwindowshelper-bot@users.noreply.github.com>

@jeffhostetler

This introduces `git survey` to Git for Windows ahead of upstream for the express purpose of getting the path-based analysis in the hands of more folks. The inspiration of this builtin is [`git-sizer`](https://github.com/github/git-sizer), but since that command relies on `git cat-file --batch` to get the contents of objects, it has limits to how much information it can provide. This is mostly a rewrite of the `git survey` builtin that was introduced into the `microsoft/git` fork in microsoft#667. That version had a lot more bells and whistles, including an analysis much closer to what `git-sizer` provides. The biggest difference in this version is that this one is focused on using the path-walk API in order to visit batches of objects based on a common path. This allows identifying, for instance, the path that is contributing the most to the on-disk size across all versions at that path. For example, here are the top ten paths contributing to my local Git repository (which includes `microsoft/git` and `gitster/git`): ``` TOP FILES BY DISK SIZE ============================================================================ Path | Count | Disk Size | Inflated Size -----------------------------------------+-------+-----------+-------------- whats-cooking.txt | 1373 | 11637459 | 37226854 t/helper/test-gvfs-protocol | 2 | 6847105 | 17233072 git-rebase--helper | 1 | 6027849 | 15269664 compat/mingw.c | 6111 | 5194453 | 463466970 t/helper/test-parse-options | 1 | 3420385 | 8807968 t/helper/test-pkt-line | 1 | 3408661 | 8778960 t/helper/test-dump-untracked-cache | 1 | 3408645 | 8780816 t/helper/test-dump-fsmonitor | 1 | 3406639 | 8776656 po/vi.po | 104 | 1376337 | 51441603 po/de.po | 210 | 1360112 | 71198603 ``` This kind of analysis has been helpful in identifying the reasons for growth in a few internal monorepos. Those findings motivated the changes in #5157 and #5171. With this early version in Git for Windows, we can expand the reach of the experimental tool in advance of it being contributed to the upstream project. Unfortunately, this will mean that in the next `microsoft/git` rebase, @jeffhostetler's version will need to be pulled out since there are enough conflicts. These conflicts include how tables are stored and generated, as the version in this PR is slightly more general to allow for different kinds of data.

@jeffhostetler

This introduces `git survey` to Git for Windows ahead of upstream for the express purpose of getting the path-based analysis in the hands of more folks. The inspiration of this builtin is [`git-sizer`](https://github.com/github/git-sizer), but since that command relies on `git cat-file --batch` to get the contents of objects, it has limits to how much information it can provide. This is mostly a rewrite of the `git survey` builtin that was introduced into the `microsoft/git` fork in microsoft#667. That version had a lot more bells and whistles, including an analysis much closer to what `git-sizer` provides. The biggest difference in this version is that this one is focused on using the path-walk API in order to visit batches of objects based on a common path. This allows identifying, for instance, the path that is contributing the most to the on-disk size across all versions at that path. For example, here are the top ten paths contributing to my local Git repository (which includes `microsoft/git` and `gitster/git`): ``` TOP FILES BY DISK SIZE ============================================================================ Path | Count | Disk Size | Inflated Size -----------------------------------------+-------+-----------+-------------- whats-cooking.txt | 1373 | 11637459 | 37226854 t/helper/test-gvfs-protocol | 2 | 6847105 | 17233072 git-rebase--helper | 1 | 6027849 | 15269664 compat/mingw.c | 6111 | 5194453 | 463466970 t/helper/test-parse-options | 1 | 3420385 | 8807968 t/helper/test-pkt-line | 1 | 3408661 | 8778960 t/helper/test-dump-untracked-cache | 1 | 3408645 | 8780816 t/helper/test-dump-fsmonitor | 1 | 3406639 | 8776656 po/vi.po | 104 | 1376337 | 51441603 po/de.po | 210 | 1360112 | 71198603 ``` This kind of analysis has been helpful in identifying the reasons for growth in a few internal monorepos. Those findings motivated the changes in #5157 and #5171. With this early version in Git for Windows, we can expand the reach of the experimental tool in advance of it being contributed to the upstream project. Unfortunately, this will mean that in the next `microsoft/git` rebase, @jeffhostetler's version will need to be pulled out since there are enough conflicts. These conflicts include how tables are stored and generated, as the version in this PR is slightly more general to allow for different kinds of data.

@jeffhostetler

This introduces `git survey` to Git for Windows ahead of upstream for the express purpose of getting the path-based analysis in the hands of more folks. The inspiration of this builtin is [`git-sizer`](https://github.com/github/git-sizer), but since that command relies on `git cat-file --batch` to get the contents of objects, it has limits to how much information it can provide. This is mostly a rewrite of the `git survey` builtin that was introduced into the `microsoft/git` fork in microsoft#667. That version had a lot more bells and whistles, including an analysis much closer to what `git-sizer` provides. The biggest difference in this version is that this one is focused on using the path-walk API in order to visit batches of objects based on a common path. This allows identifying, for instance, the path that is contributing the most to the on-disk size across all versions at that path. For example, here are the top ten paths contributing to my local Git repository (which includes `microsoft/git` and `gitster/git`): ``` TOP FILES BY DISK SIZE ============================================================================ Path | Count | Disk Size | Inflated Size -----------------------------------------+-------+-----------+-------------- whats-cooking.txt | 1373 | 11637459 | 37226854 t/helper/test-gvfs-protocol | 2 | 6847105 | 17233072 git-rebase--helper | 1 | 6027849 | 15269664 compat/mingw.c | 6111 | 5194453 | 463466970 t/helper/test-parse-options | 1 | 3420385 | 8807968 t/helper/test-pkt-line | 1 | 3408661 | 8778960 t/helper/test-dump-untracked-cache | 1 | 3408645 | 8780816 t/helper/test-dump-fsmonitor | 1 | 3406639 | 8776656 po/vi.po | 104 | 1376337 | 51441603 po/de.po | 210 | 1360112 | 71198603 ``` This kind of analysis has been helpful in identifying the reasons for growth in a few internal monorepos. Those findings motivated the changes in #5157 and #5171. With this early version in Git for Windows, we can expand the reach of the experimental tool in advance of it being contributed to the upstream project. Unfortunately, this will mean that in the next `microsoft/git` rebase, @jeffhostetler's version will need to be pulled out since there are enough conflicts. These conflicts include how tables are stored and generated, as the version in this PR is slightly more general to allow for different kinds of data.

@jeffhostetler

This introduces `git survey` to Git for Windows ahead of upstream for the express purpose of getting the path-based analysis in the hands of more folks. The inspiration of this builtin is [`git-sizer`](https://github.com/github/git-sizer), but since that command relies on `git cat-file --batch` to get the contents of objects, it has limits to how much information it can provide. This is mostly a rewrite of the `git survey` builtin that was introduced into the `microsoft/git` fork in microsoft#667. That version had a lot more bells and whistles, including an analysis much closer to what `git-sizer` provides. The biggest difference in this version is that this one is focused on using the path-walk API in order to visit batches of objects based on a common path. This allows identifying, for instance, the path that is contributing the most to the on-disk size across all versions at that path. For example, here are the top ten paths contributing to my local Git repository (which includes `microsoft/git` and `gitster/git`): ``` TOP FILES BY DISK SIZE ============================================================================ Path | Count | Disk Size | Inflated Size -----------------------------------------+-------+-----------+-------------- whats-cooking.txt | 1373 | 11637459 | 37226854 t/helper/test-gvfs-protocol | 2 | 6847105 | 17233072 git-rebase--helper | 1 | 6027849 | 15269664 compat/mingw.c | 6111 | 5194453 | 463466970 t/helper/test-parse-options | 1 | 3420385 | 8807968 t/helper/test-pkt-line | 1 | 3408661 | 8778960 t/helper/test-dump-untracked-cache | 1 | 3408645 | 8780816 t/helper/test-dump-fsmonitor | 1 | 3406639 | 8776656 po/vi.po | 104 | 1376337 | 51441603 po/de.po | 210 | 1360112 | 71198603 ``` This kind of analysis has been helpful in identifying the reasons for growth in a few internal monorepos. Those findings motivated the changes in #5157 and #5171. With this early version in Git for Windows, we can expand the reach of the experimental tool in advance of it being contributed to the upstream project. Unfortunately, this will mean that in the next `microsoft/git` rebase, @jeffhostetler's version will need to be pulled out since there are enough conflicts. These conflicts include how tables are stored and generated, as the version in this PR is slightly more general to allow for different kinds of data.

@jeffhostetler

This introduces `git survey` to Git for Windows ahead of upstream for the express purpose of getting the path-based analysis in the hands of more folks. The inspiration of this builtin is [`git-sizer`](https://github.com/github/git-sizer), but since that command relies on `git cat-file --batch` to get the contents of objects, it has limits to how much information it can provide. This is mostly a rewrite of the `git survey` builtin that was introduced into the `microsoft/git` fork in microsoft#667. That version had a lot more bells and whistles, including an analysis much closer to what `git-sizer` provides. The biggest difference in this version is that this one is focused on using the path-walk API in order to visit batches of objects based on a common path. This allows identifying, for instance, the path that is contributing the most to the on-disk size across all versions at that path. For example, here are the top ten paths contributing to my local Git repository (which includes `microsoft/git` and `gitster/git`): ``` TOP FILES BY DISK SIZE ============================================================================ Path | Count | Disk Size | Inflated Size -----------------------------------------+-------+-----------+-------------- whats-cooking.txt | 1373 | 11637459 | 37226854 t/helper/test-gvfs-protocol | 2 | 6847105 | 17233072 git-rebase--helper | 1 | 6027849 | 15269664 compat/mingw.c | 6111 | 5194453 | 463466970 t/helper/test-parse-options | 1 | 3420385 | 8807968 t/helper/test-pkt-line | 1 | 3408661 | 8778960 t/helper/test-dump-untracked-cache | 1 | 3408645 | 8780816 t/helper/test-dump-fsmonitor | 1 | 3406639 | 8776656 po/vi.po | 104 | 1376337 | 51441603 po/de.po | 210 | 1360112 | 71198603 ``` This kind of analysis has been helpful in identifying the reasons for growth in a few internal monorepos. Those findings motivated the changes in #5157 and #5171. With this early version in Git for Windows, we can expand the reach of the experimental tool in advance of it being contributed to the upstream project. Unfortunately, this will mean that in the next `microsoft/git` rebase, @jeffhostetler's version will need to be pulled out since there are enough conflicts. These conflicts include how tables are stored and generated, as the version in this PR is slightly more general to allow for different kinds of data.

@jeffhostetler

This introduces `git survey` to Git for Windows ahead of upstream for the express purpose of getting the path-based analysis in the hands of more folks. The inspiration of this builtin is [`git-sizer`](https://github.com/github/git-sizer), but since that command relies on `git cat-file --batch` to get the contents of objects, it has limits to how much information it can provide. This is mostly a rewrite of the `git survey` builtin that was introduced into the `microsoft/git` fork in microsoft#667. That version had a lot more bells and whistles, including an analysis much closer to what `git-sizer` provides. The biggest difference in this version is that this one is focused on using the path-walk API in order to visit batches of objects based on a common path. This allows identifying, for instance, the path that is contributing the most to the on-disk size across all versions at that path. For example, here are the top ten paths contributing to my local Git repository (which includes `microsoft/git` and `gitster/git`): ``` TOP FILES BY DISK SIZE ============================================================================ Path | Count | Disk Size | Inflated Size -----------------------------------------+-------+-----------+-------------- whats-cooking.txt | 1373 | 11637459 | 37226854 t/helper/test-gvfs-protocol | 2 | 6847105 | 17233072 git-rebase--helper | 1 | 6027849 | 15269664 compat/mingw.c | 6111 | 5194453 | 463466970 t/helper/test-parse-options | 1 | 3420385 | 8807968 t/helper/test-pkt-line | 1 | 3408661 | 8778960 t/helper/test-dump-untracked-cache | 1 | 3408645 | 8780816 t/helper/test-dump-fsmonitor | 1 | 3406639 | 8776656 po/vi.po | 104 | 1376337 | 51441603 po/de.po | 210 | 1360112 | 71198603 ``` This kind of analysis has been helpful in identifying the reasons for growth in a few internal monorepos. Those findings motivated the changes in #5157 and #5171. With this early version in Git for Windows, we can expand the reach of the experimental tool in advance of it being contributed to the upstream project. Unfortunately, this will mean that in the next `microsoft/git` rebase, @jeffhostetler's version will need to be pulled out since there are enough conflicts. These conflicts include how tables are stored and generated, as the version in this PR is slightly more general to allow for different kinds of data.

@jeffhostetler

This introduces `git survey` to Git for Windows ahead of upstream for the express purpose of getting the path-based analysis in the hands of more folks. The inspiration of this builtin is [`git-sizer`](https://github.com/github/git-sizer), but since that command relies on `git cat-file --batch` to get the contents of objects, it has limits to how much information it can provide. This is mostly a rewrite of the `git survey` builtin that was introduced into the `microsoft/git` fork in microsoft#667. That version had a lot more bells and whistles, including an analysis much closer to what `git-sizer` provides. The biggest difference in this version is that this one is focused on using the path-walk API in order to visit batches of objects based on a common path. This allows identifying, for instance, the path that is contributing the most to the on-disk size across all versions at that path. For example, here are the top ten paths contributing to my local Git repository (which includes `microsoft/git` and `gitster/git`): ``` TOP FILES BY DISK SIZE ============================================================================ Path | Count | Disk Size | Inflated Size -----------------------------------------+-------+-----------+-------------- whats-cooking.txt | 1373 | 11637459 | 37226854 t/helper/test-gvfs-protocol | 2 | 6847105 | 17233072 git-rebase--helper | 1 | 6027849 | 15269664 compat/mingw.c | 6111 | 5194453 | 463466970 t/helper/test-parse-options | 1 | 3420385 | 8807968 t/helper/test-pkt-line | 1 | 3408661 | 8778960 t/helper/test-dump-untracked-cache | 1 | 3408645 | 8780816 t/helper/test-dump-fsmonitor | 1 | 3406639 | 8776656 po/vi.po | 104 | 1376337 | 51441603 po/de.po | 210 | 1360112 | 71198603 ``` This kind of analysis has been helpful in identifying the reasons for growth in a few internal monorepos. Those findings motivated the changes in #5157 and #5171. With this early version in Git for Windows, we can expand the reach of the experimental tool in advance of it being contributed to the upstream project. Unfortunately, this will mean that in the next `microsoft/git` rebase, @jeffhostetler's version will need to be pulled out since there are enough conflicts. These conflicts include how tables are stored and generated, as the version in this PR is slightly more general to allow for different kinds of data.

derrickstolee requested a review from dscho September 26, 2024 11:30

derrickstolee self-assigned this Sep 26, 2024

dscho approved these changes Sep 26, 2024

View reviewed changes

derrickstolee commented Sep 26, 2024

View reviewed changes

t/t8100-git-survey.sh Show resolved Hide resolved

derrickstolee commented Sep 26, 2024

View reviewed changes

t/t8100-git-survey.sh Show resolved Hide resolved

jeffhostetler and others added 2 commits September 26, 2024 12:06

derrickstolee force-pushed the survey-on-full branch from d57a2fa to bce8ef6 Compare September 26, 2024 16:27

derrickstolee and others added 8 commits September 26, 2024 13:25

survey: show progress during object walk

4e42826

Signed-off-by: Derrick Stolee <stolee@gmail.com>

derrickstolee force-pushed the survey-on-full branch from bce8ef6 to d28dc5b Compare September 26, 2024 17:25

dscho added this to the Next release milestone Sep 26, 2024

dscho merged commit 68f029a into git-for-windows:main Sep 26, 2024
44 checks passed

dscho mentioned this pull request Sep 28, 2024

[WORK-IN-PROGRESS] Introduce the path walk API into Git for Windows #5146

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add experimental 'git survey' builtin #5174

Add experimental 'git survey' builtin #5174

derrickstolee commented Sep 26, 2024

derrickstolee commented Sep 26, 2024

dscho commented Sep 26, 2024

dscho commented Sep 26, 2024

dscho left a comment

dscho Sep 26, 2024

derrickstolee Sep 26, 2024

dscho Sep 26, 2024

dscho Sep 26, 2024

derrickstolee commented Sep 26, 2024

dscho commented Sep 26, 2024 •

edited by gitforwindowshelper bot

Loading

Add experimental 'git survey' builtin #5174

Add experimental 'git survey' builtin #5174

Conversation

derrickstolee commented Sep 26, 2024

derrickstolee commented Sep 26, 2024

dscho commented Sep 26, 2024

dscho commented Sep 26, 2024

dscho left a comment

Choose a reason for hiding this comment

dscho Sep 26, 2024

Choose a reason for hiding this comment

derrickstolee Sep 26, 2024

Choose a reason for hiding this comment

dscho Sep 26, 2024

Choose a reason for hiding this comment

dscho Sep 26, 2024

Choose a reason for hiding this comment

derrickstolee commented Sep 26, 2024

dscho commented Sep 26, 2024 • edited by gitforwindowshelper bot Loading

dscho commented Sep 26, 2024 •

edited by gitforwindowshelper bot

Loading