Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add experimental 'git survey' builtin #5174

Merged
merged 10 commits into from
Sep 26, 2024

Conversation

derrickstolee
Copy link

This introduces git survey to Git for Windows ahead of upstream for the express purpose of getting the path-based analysis in the hands of more folks.

The inspiration of this builtin is git-sizer, but since that command relies on git cat-file --batch to get the contents of objects, it has limits to how much information it can provide.

This is mostly a rewrite of the git survey builtin that was introduced into the microsoft/git fork in microsoft#667. That version had a lot more bells and whistles, including an analysis much closer to what git-sizer provides.

The biggest difference in this version is that this one is focused on using the path-walk API in order to visit batches of objects based on a common path. This allows identifying, for instance, the path that is contributing the most to the on-disk size across all versions at that path.

For example, here are the top ten paths contributing to my local Git repository (which includes microsoft/git and gitster/git):

TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603

This kind of analysis has been helpful in identifying the reasons for growth in a few internal monorepos. Those findings motivated the changes in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of the experimental tool in advance of it being contributed to the upstream project.

Unfortunately, this will mean that in the next microsoft/git rebase, @jeffhostetler's version will need to be pulled out since there are enough conflicts. These conflicts include how tables are stored and generated, as the version in this PR is slightly more general to allow for different kinds of data.

@derrickstolee derrickstolee self-assigned this Sep 26, 2024
@derrickstolee
Copy link
Author

@dscho: The whitespace break is intentional due to the way that the output is generated (there are 8 or more spaces in the left padding of the output). Is there a way to ignore that failure?

@dscho
Copy link
Member

dscho commented Sep 26, 2024

The whitespace break is intentional due to the way that the output is generated (there are 8 or more spaces in the left padding of the output). Is there a way to ignore that failure?

@derrickstolee yes, there are a couple of ways. 0071e3c works, even if it is ugly.

An alternative would have been to pull out the big hammer:

diff --git a/t/.gitattributes b/t/.gitattributes
index 7664c6e027d..7e9356613b5 100644
--- a/t/.gitattributes
+++ b/t/.gitattributes
@@ -23,3 +23,4 @@ t[0-9][0-9][0-9][0-9]/* -whitespace
 /t8005/*.txt eol=lf
 /t9*/*.dump eol=lf
 /t0040*.sh whitespace=-indent-with-non-tab
+/t8100*.sh whitespace=-indent-with-spaces

I do not like this because it affects the entire file, not just those two lines where we need an exemption.

@dscho
Copy link
Member

dscho commented Sep 26, 2024

This is mostly a rewrite of the git survey builtin that was introduced into the microsoft/git fork in microsoft#667. That version had a lot more bells and whistles, including an analysis much closer to what git-sizer provides.

[...]

Unfortunately, this will mean that in the next microsoft/git rebase, @jeffhostetler's version will need to be pulled out since there are enough conflicts. These conflicts include how tables are stored and generated, as the version in this PR is slightly more general to allow for different kinds of data.

I hope to be able to remedy that at least to some extent, as one of the biggest benefits of having @jeffhostetler's version of git survey was that it provided comparable functionality to git sizer albeit without the multi-megabyte penalty (which is the primary reason why git sizer is not distributed alongside Git for Windows). It might take until -rc1 for me to find the time, though.

Copy link
Member

@dscho dscho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

I'd like to squash in the fixup!s, but any other suggestion I had is not really important enough to hold this PR up.

builtin/survey.c Show resolved Hide resolved
builtin/survey.c Outdated
Comment on lines 122 to 136
static struct strbuf lines = STRBUF_INIT;
size_t width = 0;
strbuf_setlen(&lines, 0);

strbuf_addch(&lines, ' ');
strbuf_addstr(&lines, name);
strbuf_addch(&lines, '\n');

for (size_t i = 0; i < nr; i++) {
if (i)
width += 3;
width += widths[i];
}
strbuf_addchars(&lines, '=', width);
printf("%s\n", lines.buf);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this can be simplified dramatically, but it may not be worth spending too many cycles on this. But if you like this, too, @derrickstolee, I would love to make it so:

const char *e = "======================================================================"
        "==============================================================================";
size_t width = (nr - 1) * 3;

for (size_t i = 0; i < nr; i++)
    width += widths[i];

printf(" %s\n%.*s\n", name, (int)width, e);

Granted, it does not scale to insanely long widths, but it is so much simpler.

On the other hand, I think this entire table printing business could do with a context that contains a strbuf scratch buffer with which to work. I guess it's not the most pressing issue right now to simplify this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like the width = (nr - 1) * 3; for sure. If the printf magic works, then excellent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works, but it's kind of ugly. I think I'd like to leave this for later, after all:

diff --git a/builtin/survey.c b/builtin/survey.c
index 37296a5feb7..2c86d78d8a6 100644
--- a/builtin/survey.c
+++ b/builtin/survey.c
@@ -270,27 +270,16 @@ static void insert_table_rowv(struct survey_table *table, ...)
 
 static void print_table_title(const char *name, size_t *widths, size_t nr)
 {
-	static struct strbuf lines = STRBUF_INIT;
-	size_t width = 0;
-	size_t min_width;
-	strbuf_setlen(&lines, 0);
+	const char *e = "======================================================"
+		"=============================================================="
+		"=============================================================";
+	size_t width = (nr - 1) * 3, min_width = strlen(name);
 
-	strbuf_addch(&lines, '\n');
-	strbuf_addstr(&lines, name);
-	min_width = lines.len - 1;
-	strbuf_addch(&lines, '\n');
-
-	for (size_t i = 0; i < nr; i++) {
-		if (i)
-			width += 3;
+	for (size_t i = 0; i < nr; i++)
 		width += widths[i];
-	}
-
-	if (width < min_width)
-		width = min_width;
 
-	strbuf_addchars(&lines, '=', width);
-	printf("%s\n", lines.buf);
+	printf("\n%s\n%.*s\n", name,
+	       (int)(min_width > width ? min_width : width), e);
 }
 
 static void print_row_plaintext(struct strvec *row, size_t *widths)

builtin/survey.c Outdated
Comment on lines 126 to 144
strbuf_addch(&lines, ' ');
strbuf_addch(&lines, '\n');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hunk probably wants to be squashed into the preceding commit.

builtin/survey.c Outdated Show resolved Hide resolved
t/t8100-git-survey.sh Outdated Show resolved Hide resolved
builtin/survey.c Show resolved Hide resolved
builtin/survey.c Outdated Show resolved Hide resolved
jeffhostetler and others added 2 commits September 26, 2024 12:06
Start work on a new 'git survey' command to scan the repository
for monorepo performance and scaling problems.  The goal is to
measure the various known "dimensions of scale" and serve as a
foundation for adding additional measurements as we learn more
about Git monorepo scaling problems.

The initial goal is to complement the scanning and analysis performed
by the GO-based 'git-sizer' (https://github.com/github/git-sizer) tool.
It is hoped that by creating a builtin command, we may be able to take
advantage of internal Git data structures and code that is not
accessible from GO to gain further insight into potential scaling
problems.

Co-authored-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Jeff Hostetler <git@jeffhostetler.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
By default we will scan all references in "refs/heads/", "refs/tags/"
and "refs/remotes/".

Add command line opts let the use ask for all refs or a subset of them
and to include a detached HEAD.

Signed-off-by: Jeff Hostetler <git@jeffhostetler.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
@derrickstolee
Copy link
Author

Latest push should handle all of the review comments. Here's the range diff from @dscho's push with fixups:

Range diff
 1:  9751447967 =  1:  3a8cd93b06 survey: stub in new experimental 'git-survey' command
 2:  cd7bdb7046 !  2:  c08fa91a24 survey: add command line opts to select references
    @@ builtin/survey.c: static void survey_load_config(struct survey_context *ctx)
     +
     +	ctx->report.refs.refs_nr = ref_array.nr;
     +	for (size_t i = 0; i < ref_array.nr; i++) {
    -+		size_t size;
    ++		unsigned long size;
     +		struct ref_array_item *item = ref_array.items[i];
     +
     +		switch (item->kind) {
 3:  672c3e6bda !  3:  fe8544c4c8 survey: start pretty printing data in table form
    @@ builtin/survey.c: static void clear_survey_context(struct survey_context *ctx)
     +	table->rows_nr++;
     +}
     +
    ++#define SECTION_SEGMENT "========================================"
    ++#define SECTION_SEGMENT_LEN 40
    ++const char *section_line = SECTION_SEGMENT
    ++			   SECTION_SEGMENT
    ++			   SECTION_SEGMENT
    ++			   SECTION_SEGMENT;
    ++const size_t section_len = 4 * SECTION_SEGMENT_LEN;
    ++
     +static void print_table_title(const char *name, size_t *widths, size_t nr)
     +{
    -+	static struct strbuf lines = STRBUF_INIT;
    -+	size_t width = 0;
    -+	strbuf_setlen(&lines, 0);
    -+
    -+	strbuf_addch(&lines, ' ');
    -+	strbuf_addstr(&lines, name);
    -+	strbuf_addch(&lines, '\n');
    ++	size_t width = 3 * (nr - 1);
     +
    -+	for (size_t i = 0; i < nr; i++) {
    -+		if (i)
    -+			width += 3;
    ++	for (size_t i = 0; i < nr; i++)
     +		width += widths[i];
    -+	}
    -+	strbuf_addchars(&lines, '=', width);
    -+	printf("%s\n", lines.buf);
    ++
    ++	if (width > section_len)
    ++		width = section_len;
    ++
    ++	printf("\n%s\n%.*s\n", name, (int)width, section_line);
     +}
     +
     +static void print_row_plaintext(struct strvec *row, size_t *widths)
    @@ builtin/survey.c: static void clear_survey_context(struct survey_context *ctx)
     +
     +	for (size_t j = 0; j < table->rows_nr; j++)
     +		print_row_plaintext(&table->rows[j], column_widths);
    ++
    ++	free(column_widths);
     +}
     +
     +static void survey_report_plaintext_refs(struct survey_context *ctx)
    @@ builtin/survey.c: static void clear_survey_context(struct survey_context *ctx)
     +	strvec_push(&table.header, _("Count"));
     +
     +	if (ctx->opts.refs.want_all_refs || ctx->opts.refs.want_branches) {
    -+		char *fmt = xstrfmt("%"PRIuMAX"", refs->branches_nr);
    ++		char *fmt = xstrfmt("%"PRIuMAX"", (uintmax_t)refs->branches_nr);
     +		insert_table_rowv(&table, _("Branches"), fmt, NULL);
     +		free(fmt);
     +	}
     +
     +	if (ctx->opts.refs.want_all_refs || ctx->opts.refs.want_remotes) {
    -+		char *fmt = xstrfmt("%"PRIuMAX"", refs->remote_refs_nr);
    ++		char *fmt = xstrfmt("%"PRIuMAX"", (uintmax_t)refs->remote_refs_nr);
     +		insert_table_rowv(&table, _("Remote refs"), fmt, NULL);
     +		free(fmt);
     +	}
     +
     +	if (ctx->opts.refs.want_all_refs || ctx->opts.refs.want_tags) {
    -+		char *fmt = xstrfmt("%"PRIuMAX"", refs->tags_nr);
    ++		char *fmt = xstrfmt("%"PRIuMAX"", (uintmax_t)refs->tags_nr);
     +		insert_table_rowv(&table, _("Tags (all)"), fmt, NULL);
     +		free(fmt);
    -+		fmt = xstrfmt("%"PRIuMAX"", refs->tags_annotated_nr);
    ++		fmt = xstrfmt("%"PRIuMAX"", (uintmax_t)refs->tags_annotated_nr);
     +		insert_table_rowv(&table, _("Tags (annotated)"), fmt, NULL);
     +		free(fmt);
     +	}
    @@ t/t8100-git-survey.sh: test_expect_success 'create a semi-interesting repo' '
     -	test_line_count = 0 err
     +	test_line_count = 0 err &&
     +
    -+	cat >expect <<-EOF &&
    ++	tr , " " >expect <<-EOF &&
     +	GIT SURVEY for "$(pwd)"
     +	-----------------------------------------------------
    -+	 REFERENCES SUMMARY
    ++
    ++	REFERENCES SUMMARY
     +	========================
    -+		Ref Type | Count
    ++	,       Ref Type | Count
     +	-----------------+------
    -+		Branches |     1
    ++	,       Branches |     1
     +	     Remote refs |     0
     +	      Tags (all) |     0
     +	Tags (annotated) |     0
 4:  846c8a69f7 !  4:  9ab09921d5 survey: add object count summary
    @@ builtin/survey.c: struct survey_context {
      	strvec_clear(&ctx->refs);
      }
      
    -@@ builtin/survey.c: static void print_table_title(const char *name, size_t *widths, size_t nr)
    +@@ builtin/survey.c: const size_t section_len = 4 * SECTION_SEGMENT_LEN;
    + static void print_table_title(const char *name, size_t *widths, size_t nr)
      {
    - 	static struct strbuf lines = STRBUF_INIT;
    - 	size_t width = 0;
    -+	size_t min_width;
    - 	strbuf_setlen(&lines, 0);
    + 	size_t width = 3 * (nr - 1);
    ++	size_t min_width = strlen(name);
      
    --	strbuf_addch(&lines, ' ');
    -+	strbuf_addch(&lines, '\n');
    - 	strbuf_addstr(&lines, name);
    -+	min_width = lines.len - 1;
    - 	strbuf_addch(&lines, '\n');
    - 
    - 	for (size_t i = 0; i < nr; i++) {
    -@@ builtin/survey.c: static void print_table_title(const char *name, size_t *widths, size_t nr)
    - 			width += 3;
    + 	for (size_t i = 0; i < nr; i++)
      		width += widths[i];
    - 	}
    -+
    + 
     +	if (width < min_width)
     +		width = min_width;
     +
    - 	strbuf_addchars(&lines, '=', width);
    - 	printf("%s\n", lines.buf);
    - }
    + 	if (width > section_len)
    + 		width = section_len;
    + 
     @@ builtin/survey.c: static void survey_report_plaintext_refs(struct survey_context *ctx)
      	clear_table(&table);
      }
    @@ builtin/survey.c: static void survey_report_plaintext_refs(struct survey_context
     +	strvec_push(&table.header, _("Object Type"));
     +	strvec_push(&table.header, _("Count"));
     +
    -+	fmt = xstrfmt("%"PRIuMAX"", objs->tags_nr);
    ++	fmt = xstrfmt("%"PRIuMAX"", (uintmax_t)objs->tags_nr);
     +	insert_table_rowv(&table, _("Tags"), fmt, NULL);
     +	free(fmt);
     +
    -+	fmt = xstrfmt("%"PRIuMAX"", objs->commits_nr);
    ++	fmt = xstrfmt("%"PRIuMAX"", (uintmax_t)objs->commits_nr);
     +	insert_table_rowv(&table, _("Commits"), fmt, NULL);
     +	free(fmt);
     +
    -+	fmt = xstrfmt("%"PRIuMAX"", objs->trees_nr);
    ++	fmt = xstrfmt("%"PRIuMAX"", (uintmax_t)objs->trees_nr);
     +	insert_table_rowv(&table, _("Trees"), fmt, NULL);
     +	free(fmt);
     +
    -+	fmt = xstrfmt("%"PRIuMAX"", objs->blobs_nr);
    ++	fmt = xstrfmt("%"PRIuMAX"", (uintmax_t)objs->blobs_nr);
     +	insert_table_rowv(&table, _("Blobs"), fmt, NULL);
     +	free(fmt);
     +
    @@ builtin/survey.c: static void do_load_refs(struct survey_context *ctx,
     -	for (size_t i = 0; i < ref_array.nr; i++) {
     +	ctx->report.refs.refs_nr = ctx->ref_array.nr;
     +	for (size_t i = 0; i < ctx->ref_array.nr; i++) {
    - 		size_t size;
    + 		unsigned long size;
     -		struct ref_array_item *item = ref_array.items[i];
     +		struct ref_array_item *item = ctx->ref_array.items[i];
      
    @@ builtin/survey.c: static void survey_phase_refs(struct survey_context *ctx)
      
      	trace2_region_leave("survey", "phase/refs", ctx->repo);
     +}
    - 
    --	ref_array_clear(&ref_array);
    ++
     +static void increment_object_counts(
     +		struct survey_report_object_summary *summary,
     +		enum object_type type,
    @@ builtin/survey.c: static void survey_phase_refs(struct survey_context *ctx)
     +	case OBJ_COMMIT:
     +		summary->commits_nr += nr;
     +		break;
    -+
    + 
    +-	ref_array_clear(&ref_array);
     +	case OBJ_TREE:
     +		summary->trees_nr += nr;
     +		break;
    @@ t/t8100-git-survey.sh: test_expect_success 'git survey -h shows experimental war
     +	git survey --all-refs >out 2>err &&
      	test_line_count = 0 err &&
      
    - 	cat >expect <<-EOF &&
    - 	GIT SURVEY for "$(pwd)"
    - 	-----------------------------------------------------
    --	 REFERENCES SUMMARY
    -+
    -+	REFERENCES SUMMARY
    - 	========================
    --		Ref Type | Count
    -+	        Ref Type | Count
    + 	tr , " " >expect <<-EOF &&
    +@@ t/t8100-git-survey.sh: test_expect_success 'git survey (default)' '
      	-----------------+------
    --		Branches |     1
    -+	        Branches |     1
    + 	,       Branches |     1
      	     Remote refs |     0
     -	      Tags (all) |     0
     -	Tags (annotated) |     0
 5:  7fc3fc8022 !  5:  f4fde444fd survey: summarize total sizes by object type
    @@ builtin/survey.c: static void survey_report_plaintext_reachable_object_summary(s
     +	struct survey_table table = SURVEY_TABLE_INIT;
     +	table.table_name = title;
     +
    -+	strvec_push(&table.header, xstrdup(categories));
    -+	strvec_push(&table.header, xstrdup(_("Count")));
    -+	strvec_push(&table.header, xstrdup(_("Disk Size")));
    -+	strvec_push(&table.header, xstrdup(_("Inflated Size")));
    ++	strvec_push(&table.header, categories);
    ++	strvec_push(&table.header, _("Count"));
    ++	strvec_push(&table.header, _("Disk Size"));
    ++	strvec_push(&table.header, _("Inflated Size"));
     +
     +	for (size_t i = 0; i < summary_nr; i++) {
    -+		insert_table_rowv(&table, xstrdup(summary[i].label),
    -+				  xstrfmt("%"PRIuMAX, summary[i].nr),
    -+				  xstrfmt("%"PRIuMAX, summary[i].disk_size),
    -+				  xstrfmt("%"PRIuMAX, summary[i].inflated_size),
    -+				  NULL);
    ++		char *label_str =  xstrdup(summary[i].label);
    ++		char *nr_str = xstrfmt("%"PRIuMAX, (uintmax_t)summary[i].nr);
    ++		char *disk_str = xstrfmt("%"PRIuMAX, (uintmax_t)summary[i].disk_size);
    ++		char *inflate_str = xstrfmt("%"PRIuMAX, (uintmax_t)summary[i].inflated_size);
    ++
    ++		insert_table_rowv(&table, label_str, nr_str,
    ++				  disk_str, inflate_str, NULL);
    ++
    ++		free(label_str);
    ++		free(nr_str);
    ++		free(disk_str);
    ++		free(inflate_str);
     +	}
     +
     +	print_table_plaintext(&table);
    @@ builtin/survey.c: static void survey_phase_objects(struct survey_context *ctx)
      	walk_objects_by_path(&info);
     
      ## t/t8100-git-survey.sh ##
    +@@ t/t8100-git-survey.sh: test_expect_success 'git survey (default)' '
    + 	git survey --all-refs >out 2>err &&
    + 	test_line_count = 0 err &&
    + 
    ++	test_oid_cache <<-EOF &&
    ++	commits_size_on_disk sha1:     1523
    ++	commits_size_on_disk sha256:     1811
    ++
    ++	commits_size sha1:         2153
    ++	commits_size sha256:         2609
    ++
    ++	trees_size_on_disk sha1:      495
    ++	trees_size_on_disk sha256:      635
    ++
    ++	trees_size sha1:         1706
    ++	trees_size sha256:         2366
    ++
    ++	tags_size sha1:          528
    ++	tags_size sha256:          624
    ++
    ++	tags_size_on_disk sha1:      510
    ++	tags_size_on_disk sha256:      569
    ++	EOF
    ++
    + 	tr , " " >expect <<-EOF &&
    + 	GIT SURVEY for "$(pwd)"
    + 	-----------------------------------------------------
     @@ t/t8100-git-survey.sh: test_expect_success 'git survey (default)' '
      	    Commits |    10
      	      Trees |    10
    @@ t/t8100-git-survey.sh: test_expect_success 'git survey (default)' '
     +	===============================================
     +	Object Type | Count | Disk Size | Inflated Size
     +	------------+-------+-----------+--------------
    -+	    Commits |    10 |      1523 |          2153
    -+	      Trees |    10 |       495 |          1706
    ++	    Commits |    10 | $(test_oid commits_size_on_disk) | $(test_oid commits_size)
    ++	      Trees |    10 | $(test_oid trees_size_on_disk) | $(test_oid trees_size)
     +	      Blobs |    10 |       191 |           101
    -+	       Tags |     4 |       510 |           528
    ++	       Tags |     4 | $(test_oid tags_size_on_disk) | $(test_oid tags_size)
      	EOF
      
      	test_cmp expect out
 6:  448892d211 !  6:  21f75ebcf2 survey: show progress during object walk
    @@ t/t8100-git-survey.sh: test_expect_success 'create a semi-interesting repo' '
      '
      
     +test_expect_success 'git survey --progress' '
    -+	GIT_PROGRESS_DELATY=0 git survey --all-refs --progress >out 2>err &&
    ++	GIT_PROGRESS_DELAY=0 git survey --all-refs --progress >out 2>err &&
     +	grep "Preparing object walk" err
     +'
     +
 7:  d886c14550 !  7:  b6dd26b629 survey: add ability to track prioritized lists
    @@ builtin/survey.c: struct survey_report_object_size_summary {
      	size_t num_missing;
      };
      
    -+typedef int (*survey_top_size_cmp)(struct survey_report_object_size_summary *s1,
    -+				   struct survey_report_object_size_summary *s2);
    ++typedef int (*survey_top_cmp)(void *v1, void *v2);
     +
     +MAYBE_UNUSED
    -+static int cmp_by_nr(struct survey_report_object_size_summary *s1,
    -+		     struct survey_report_object_size_summary *s2)
    ++static int cmp_by_nr(void *v1, void *v2)
     +{
    ++	struct survey_report_object_size_summary *s1 = v1;
    ++	struct survey_report_object_size_summary *s2 = v2;
    ++
     +	if (s1->nr < s2->nr)
     +		return -1;
     +	if (s1->nr > s2->nr)
    @@ builtin/survey.c: struct survey_report_object_size_summary {
     +}
     +
     +MAYBE_UNUSED
    -+static int cmp_by_disk_size(struct survey_report_object_size_summary *s1,
    -+			    struct survey_report_object_size_summary *s2)
    ++static int cmp_by_disk_size(void *v1, void *v2)
     +{
    ++	struct survey_report_object_size_summary *s1 = v1;
    ++	struct survey_report_object_size_summary *s2 = v2;
    ++
     +	if (s1->disk_size < s2->disk_size)
     +		return -1;
     +	if (s1->disk_size > s2->disk_size)
    @@ builtin/survey.c: struct survey_report_object_size_summary {
     +}
     +
     +MAYBE_UNUSED
    -+static int cmp_by_inflated_size(struct survey_report_object_size_summary *s1,
    -+				struct survey_report_object_size_summary *s2)
    ++static int cmp_by_inflated_size(void *v1, void *v2)
     +{
    ++	struct survey_report_object_size_summary *s1 = v1;
    ++	struct survey_report_object_size_summary *s2 = v2;
    ++
     +	if (s1->inflated_size < s2->inflated_size)
     +		return -1;
     +	if (s1->inflated_size > s2->inflated_size)
    @@ builtin/survey.c: struct survey_report_object_size_summary {
     + * inserting a new category, reorder the list and free the one that
     + * got ejected (if any).
     + */
    -+struct survey_report_top_sizes {
    ++struct survey_report_top_table {
     +	const char *name;
    -+	survey_top_size_cmp cmp_fn;
    -+	struct survey_report_object_size_summary *data;
    ++	survey_top_cmp cmp_fn;
     +	size_t nr;
     +	size_t alloc;
    ++
    ++	/**
    ++	 * 'data' stores an array of structs and must be cast into
    ++	 * the proper array type before evaluating an index.
    ++	 */
    ++	void *data;
     +};
     +
     +MAYBE_UNUSED
    -+static void init_top_sizes(struct survey_report_top_sizes *top,
    ++static void init_top_sizes(struct survey_report_top_table *top,
     +			   size_t limit, const char *name,
    -+			   survey_top_size_cmp cmp)
    ++			   survey_top_cmp cmp)
     +{
    ++	struct survey_report_object_size_summary *sz_array;
    ++
     +	top->name = name;
    ++	top->cmp_fn = cmp;
     +	top->alloc = limit;
     +	top->nr = 0;
    -+	CALLOC_ARRAY(top->data, limit);
    -+	top->cmp_fn = cmp;
    ++
    ++	CALLOC_ARRAY(sz_array, limit);
    ++	top->data = sz_array;
     +}
     +
     +MAYBE_UNUSED
    -+static void clear_top_sizes(struct survey_report_top_sizes *top)
    ++static void clear_top_sizes(struct survey_report_top_table *top)
     +{
    ++	struct survey_report_object_size_summary *sz_array = top->data;
    ++
     +	for (size_t i = 0; i < top->nr; i++)
    -+		free(top->data[i].label);
    -+	free(top->data);
    ++		free(sz_array[i].label);
    ++	free(sz_array);
     +}
     +
     +MAYBE_UNUSED
    -+static void maybe_insert_into_top_size(struct survey_report_top_sizes *top,
    ++static void maybe_insert_into_top_size(struct survey_report_top_table *top,
     +				       struct survey_report_object_size_summary *summary)
     +{
    ++	struct survey_report_object_size_summary *sz_array = top->data;
     +	size_t pos = top->nr;
     +
     +	/* Compare against list from the bottom. */
    -+	while (pos > 0 && top->cmp_fn(&top->data[pos - 1], summary) < 0)
    ++	while (pos > 0 && top->cmp_fn(&sz_array[pos - 1], summary) < 0)
     +		pos--;
     +
     +	/* Not big enough! */
    @@ builtin/survey.c: struct survey_report_object_size_summary {
     +
     +	/* We need to shift the data. */
     +	if (top->nr == top->alloc)
    -+		free(top->data[top->nr - 1].label);
    ++		free(sz_array[top->nr - 1].label);
     +	else
     +		top->nr++;
     +
     +	for (size_t i = top->nr - 1; i > pos; i--)
    -+		memcpy(&top->data[i], &top->data[i - 1], sizeof(*top->data));
    ++		memcpy(&sz_array[i], &sz_array[i - 1], sizeof(*sz_array));
     +
    -+	memcpy(&top->data[pos], summary, sizeof(*summary));
    -+	top->data[pos].label = xstrdup(summary->label);
    ++	memcpy(&sz_array[pos], summary, sizeof(*summary));
    ++	sz_array[pos].label = xstrdup(summary->label);
     +}
     +
      /**
 8:  5eee0c9dff !  8:  13ccb6d3a2 survey: add report of "largest" paths
    @@ Commit message
     
      ## builtin/survey.c ##
     @@ builtin/survey.c: struct survey_report_object_size_summary {
    - 	size_t num_missing;
    - };
      
    --typedef int (*survey_top_size_cmp)(struct survey_report_object_size_summary *s1,
    --				   struct survey_report_object_size_summary *s2);
    -+typedef int (*survey_top_cmp)(void *v1,
    -+			      void *v2);
    + typedef int (*survey_top_cmp)(void *v1, void *v2);
      
     -MAYBE_UNUSED
    --static int cmp_by_nr(struct survey_report_object_size_summary *s1,
    --		     struct survey_report_object_size_summary *s2)
    -+static int cmp_by_nr(void *v1, void *v2)
    + static int cmp_by_nr(void *v1, void *v2)
      {
    -+	struct survey_report_object_size_summary *s1 = v1;
    -+	struct survey_report_object_size_summary *s2 = v2;
    -+
    - 	if (s1->nr < s2->nr)
    - 		return -1;
    - 	if (s1->nr > s2->nr)
    -@@ builtin/survey.c: static int cmp_by_nr(struct survey_report_object_size_summary *s1,
    + 	struct survey_report_object_size_summary *s1 = v1;
    +@@ builtin/survey.c: static int cmp_by_nr(void *v1, void *v2)
      	return 0;
      }
      
     -MAYBE_UNUSED
    --static int cmp_by_disk_size(struct survey_report_object_size_summary *s1,
    --			    struct survey_report_object_size_summary *s2)
    -+static int cmp_by_disk_size(void *v1, void *v2)
    + static int cmp_by_disk_size(void *v1, void *v2)
      {
    -+	struct survey_report_object_size_summary *s1 = v1;
    -+	struct survey_report_object_size_summary *s2 = v2;
    -+
    - 	if (s1->disk_size < s2->disk_size)
    - 		return -1;
    - 	if (s1->disk_size > s2->disk_size)
    -@@ builtin/survey.c: static int cmp_by_disk_size(struct survey_report_object_size_summary *s1,
    + 	struct survey_report_object_size_summary *s1 = v1;
    +@@ builtin/survey.c: static int cmp_by_disk_size(void *v1, void *v2)
      	return 0;
      }
      
     -MAYBE_UNUSED
    --static int cmp_by_inflated_size(struct survey_report_object_size_summary *s1,
    --				struct survey_report_object_size_summary *s2)
    -+static int cmp_by_inflated_size(void *v1, void *v2)
    + static int cmp_by_inflated_size(void *v1, void *v2)
      {
    -+	struct survey_report_object_size_summary *s1 = v1;
    -+	struct survey_report_object_size_summary *s2 = v2;
    -+
    - 	if (s1->inflated_size < s2->inflated_size)
    - 		return -1;
    - 	if (s1->inflated_size > s2->inflated_size)
    -@@ builtin/survey.c: static int cmp_by_inflated_size(struct survey_report_object_size_summary *s1,
    -  * inserting a new category, reorder the list and free the one that
    -  * got ejected (if any).
    -  */
    --struct survey_report_top_sizes {
    -+struct survey_report_top_table {
    - 	const char *name;
    --	survey_top_size_cmp cmp_fn;
    --	struct survey_report_object_size_summary *data;
    -+	survey_top_cmp cmp_fn;
    - 	size_t nr;
    - 	size_t alloc;
    -+
    -+	/**
    -+	 * 'data' stores an array of structs and must be cast into
    -+	 * the proper array type before evaluating an index.
    -+	 */
    -+	void *data;
    + 	struct survey_report_object_size_summary *s1 = v1;
    +@@ builtin/survey.c: struct survey_report_top_table {
    + 	void *data;
      };
      
     -MAYBE_UNUSED
    --static void init_top_sizes(struct survey_report_top_sizes *top,
    -+static void init_top_sizes(struct survey_report_top_table *top,
    + static void init_top_sizes(struct survey_report_top_table *top,
      			   size_t limit, const char *name,
    --			   survey_top_size_cmp cmp)
    -+			   survey_top_cmp cmp)
    - {
    -+	struct survey_report_object_size_summary *sz_array;
    -+
    - 	top->name = name;
    -+	top->cmp_fn = cmp;
    - 	top->alloc = limit;
    - 	top->nr = 0;
    --	CALLOC_ARRAY(top->data, limit);
    --	top->cmp_fn = cmp;
    -+
    -+	CALLOC_ARRAY(sz_array, limit);
    -+	top->data = sz_array;
    - }
    - 
    - MAYBE_UNUSED
    --static void clear_top_sizes(struct survey_report_top_sizes *top)
    -+static void clear_top_sizes(struct survey_report_top_table *top)
    - {
    -+	struct survey_report_object_size_summary *sz_array = top->data;
    - 	for (size_t i = 0; i < top->nr; i++)
    --		free(top->data[i].label);
    -+		free(sz_array[i].label);
    - 	free(top->data);
    + 			   survey_top_cmp cmp)
    +@@ builtin/survey.c: static void clear_top_sizes(struct survey_report_top_table *top)
    + 	free(sz_array);
      }
      
     -MAYBE_UNUSED
    --static void maybe_insert_into_top_size(struct survey_report_top_sizes *top,
    -+static void maybe_insert_into_top_size(struct survey_report_top_table *top,
    + static void maybe_insert_into_top_size(struct survey_report_top_table *top,
      				       struct survey_report_object_size_summary *summary)
      {
    -+	struct survey_report_object_size_summary *sz_array = top->data;
    - 	size_t pos = top->nr;
    - 
    - 	/* Compare against list from the bottom. */
    --	while (pos > 0 && top->cmp_fn(&top->data[pos - 1], summary) < 0)
    -+	while (pos > 0 && top->cmp_fn(&sz_array[pos - 1], summary) < 0)
    - 		pos--;
    - 
    - 	/* Not big enough! */
    -@@ builtin/survey.c: static void maybe_insert_into_top_size(struct survey_report_top_sizes *top,
    - 
    - 	/* We need to shift the data. */
    - 	if (top->nr == top->alloc)
    --		free(top->data[top->nr - 1].label);
    -+		free(sz_array[top->nr - 1].label);
    - 	else
    - 		top->nr++;
    - 
    - 	for (size_t i = top->nr - 1; i > pos; i--)
    --		memcpy(&top->data[i], &top->data[i - 1], sizeof(*top->data));
    -+		memcpy(&sz_array[i], &sz_array[i - 1], sizeof(*sz_array));
    - 
    --	memcpy(&top->data[pos], summary, sizeof(*summary));
    --	top->data[pos].label = xstrdup(summary->label);
    -+	memcpy(&sz_array[pos], summary, sizeof(*summary));
    -+	sz_array[pos].label = xstrdup(summary->label);
    - }
    - 
    - /**
     @@ builtin/survey.c: struct survey_report {
      	struct survey_report_object_summary reachable_objects;
      
    @@ builtin/survey.c: static int survey_objects_path_walk_fn(const char *path,
     
      ## t/t8100-git-survey.sh ##
     @@ t/t8100-git-survey.sh: test_expect_success 'git survey (default)' '
    - 	       Tags |     4 |       510 |           528
    + 	       Tags |     4 | $(test_oid tags_size_on_disk) | $(test_oid tags_size)
      	EOF
      
     -	test_cmp expect out
 9:  dac3510837 =  9:  f9351d993a survey: add --top=<N> option and config
10:  0071e3c0c1 <  -:  ---------- fixup! survey: add object count summary
11:  4e69b52010 <  -:  ---------- fixup! survey: add command line opts to select references
12:  4e9fddcb96 <  -:  ---------- fixup! survey: summarize total sizes by object type
13:  1d11819bc3 <  -:  ---------- fixup! survey: summarize total sizes by object type
14:  97efdc3e02 <  -:  ---------- fixup! survey: add object count summary
15:  269b053d0a <  -:  ---------- fixup! survey: start pretty printing data in table form
16:  9378ca9664 ! 10:  bce8ef63c8 survey: clearly note the experimental nature in the output
    @@ builtin/survey.c: int cmd_survey(int argc, const char **argv, const char *prefix
      	if (argc == 2 && !strcmp(argv[1], "-h"))
      		usage_with_options(survey_usage, survey_options);
      
    -+	color_fprintf_ln(stderr,
    -+			 want_color_fd(2, GIT_COLOR_AUTO) ? GIT_COLOR_YELLOW : "",
    -+			 "(THIS IS EXPERIMENTAL, EXPECT THE OUTPUT FORMAT TO CHANGE!)");
    ++	if (isatty(2))
    ++		color_fprintf_ln(stderr,
    ++				 want_color_fd(2, GIT_COLOR_AUTO) ? GIT_COLOR_YELLOW : "",
    ++				 "(THIS IS EXPERIMENTAL, EXPECT THE OUTPUT FORMAT TO CHANGE!)");
     +
      	ctx.repo = the_repository;
      
17:  5d4ed3b475 <  -:  ---------- fixup! survey: clearly note the experimental nature in the output
18:  e42c3e6abb <  -:  ---------- fixup! survey: show progress during object walk

derrickstolee and others added 8 commits September 26, 2024 13:25
When 'git survey' provides information to the user, this will be presented
in one of two formats: plaintext and JSON. The JSON implementation will be
delayed until the functionality is complete for the plaintext format.

The most important parts of the plaintext format are headers specifying the
different sections of the report and tables providing concreted data.

Create a custom table data structure that allows specifying a list of
strings for the row values. When printing the table, check each column for
the maximum width so we can create a table of the correct size from the
start.

The table structure is designed to be flexible to the different kinds of
output that will be implemented in future changes.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
At the moment, nothing is obvious about the reason for the use of the
path-walk API, but this will become more prevelant in future iterations. For
now, use the path-walk API to sum up the counts of each kind of object.

For example, this is the reachable object summary output for my local repo:

REACHABLE OBJECT SUMMARY
========================
Object Type |  Count
------------+-------
       Tags |   1343
    Commits | 179344
      Trees | 314350
      Blobs | 184030

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Now that we have explored objects by count, we can expand that a bit more to
summarize the data for the on-disk and inflated size of those objects. This
information is helpful for diagnosing both why disk space (and perhaps
clone or fetch times) is growing but also why certain operations are slow
because the inflated size of the abstract objects that must be processed is
so large.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
In future changes, we will make use of these methods. The intention is to
keep track of the top contributors according to some metric. We don't want
to store all of the entries and do a sort at the end, so track a
constant-size table and remove rows that get pushed out depending on the
chosen sorting algorithm.

Co-authored-by: Jeff Hostetler <git@jeffhostetler.com>
Signed-off-by; Jeff Hostetler <git@jeffhostetler.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Since we are already walking our reachable objects using the path-walk API,
let's now collect lists of the paths that contribute most to different
metrics. Specifically, we care about

 * Number of versions.
 * Total size on disk.
 * Total inflated size (no delta or zlib compression).

This information can be critical to discovering which parts of the
repository are causing the most growth, especially on-disk size. Different
packing strategies might help compress data more efficiently, but the toal
inflated size is a representation of the raw size of all snapshots of those
paths. Even when stored efficiently on disk, that size represents how much
information must be processed to complete a command such as 'git blame'.

Since the on-disk size is likely to be fragile, stop testing the exact
output of 'git survey' and check that the correct set of headers is
output.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
The 'git survey' builtin provides several detail tables, such as "top
files by on-disk size". The size of these tables defaults to 100,
currently.

Allow the user to specify this number via a new --top=<N> option or the
new survey.top config key.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
While this command is definitely something we _want_, chances are that
upstreaming this will require substantial changes.

We still want to be able to experiment with this before that, to focus
on what we need out of this command: To assist with diagnosing issues
with large repositories, as well as to help monitoring the growth and
the associated painpoints of such repositories.

To that end, we are about to integrate this command into
`microsoft/git`, to get the tool into the hands of users who need it
most, with the idea to iterate in close collaboration between these
users and the developers familar with Git's internals.

However, we will definitely want to avoid letting anybody have the
impression that this command, its exact inner workings, as well as its
output format, are anywhere close to stable. To make that fact utterly
clear (and thereby protect the freedom to iterate and innovate freely
before upstreaming the command), let's mark its output as experimental
in all-caps, as the first thing we do.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
@dscho
Copy link
Member

dscho commented Sep 26, 2024

/add relnote feature The new, experimental git survey command was added. This command is designed to help identify less-than-ideal data shape in monorepos, and it will likely see highly active development. Stay tuned!

The workflow run was started

github-actions bot pushed a commit to git-for-windows/build-extra that referenced this pull request Sep 26, 2024
The new, experimental [`git survey` command was
added](git-for-windows/git#5174). This command
is designed to help identify less-than-ideal data shape in monorepos,
and it will likely see highly active development. Stay tuned!

Signed-off-by: gitforwindowshelper[bot] <gitforwindowshelper-bot@users.noreply.github.com>
@dscho dscho merged commit 68f029a into git-for-windows:main Sep 26, 2024
44 checks passed
git-for-windows-ci pushed a commit that referenced this pull request Sep 26, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
dscho added a commit that referenced this pull request Sep 26, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
dscho added a commit that referenced this pull request Sep 26, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
dscho added a commit that referenced this pull request Sep 26, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
dscho added a commit that referenced this pull request Sep 26, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
dscho added a commit that referenced this pull request Sep 26, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
dscho added a commit that referenced this pull request Sep 27, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants