Switch `torngit.github.list_repos()` to graphql, measure performance difference #139

matt-codecov · 2023-07-31T20:57:57Z

torngit.github.list_repos() uses GitHub's REST API to enumerate repos visible to a given user (or the currently signed-in one). The REST API responds with pages of 100 repos, each with all of its details filled out (example). If we convert these requests to use GitHub's GraphQL API, we can request just the fields we need and maybe see a performance boost.

GitHub has a GraphQL Explorer app that you can use to draft your queries: https://docs.github.com/en/graphql/overview/explorer
An example query would look something like:

query {
  viewer {
      login
      id
      repositories(
        ownerAffiliations: [OWNER, COLLABORATOR, ORGANIZATION_MEMBER]
        first: 10
        after: "Y3Vyc29yOnYyOpHOAXbfGQ=="
      ) {
        pageInfo {
          endCursor
          hasNextPage
        }
        edges {
          cursor
          node {
            owner {
              login
              id
            }
            id
            name
          }
        }
      }
    }
  }
}

first is the page size, after is the "cursor" indicating the end of the last page. pageInfo gives you endCursor, the end of the current page, and hasNextPage, which is what it says on the tin.

The ownerAffiliations thing is telling GitHub which repos to fetch. We want to test and understand any differences between what this returns and what the REST API returned.

Hopefully, this will be a significant performance improvement for users that, between all their orgs, are syncing hundreds of repos.
There's a timer wrapping the call to list_repos() in sync_repos which is viewable on this Elastic dashboard or is available in Grafana under the name worker_SyncReposTask_sync_repos_list_repos. Watch this metric after deploying to see whether the median, P75, or P95 are made lower or flatter. If so, look for more opportunities!

If networking was not the bottleneck here, or if the bloated REST API responses are actually compressed super well, then we won't see much of an impact here. But it's worth investigating!

The text was updated successfully, but these errors were encountered:

matt-codecov · 2023-08-25T17:34:09Z

https://gist.github.com/matt-codecov/6c6b9a3e652ad41103d2557761eafac0

i compared three approaches using a token for my account

current behavior
pre-fetching the total + concurrently downloading
current behavior but using the graphql api

the graphql API performed about as well as the current behavior with 140 repos to fetch (tried page size of 10 and 100) but the pre-fetching approach, which can't be done with the graphql api, performed significantly better.

the rest API uses numbered pages so if you know how many repos you have to fetch you can precompute the URL for each page and download them all concurrently. the graphQL API uses "cursors" for pagination and with that approach each page's query depends on data from the previous page (its endCursor field) so they can't all be downloaded concurrently

closing this issue, the prefetch/concurrent approach will probably be done in codecov/shared#24 when i'm back from vacation

matt-codecov assigned joseph-sentry Jul 31, 2023

matt-codecov mentioned this issue Aug 2, 2023

Improve performance of repo syncing #101

Closed

matt-codecov mentioned this issue Aug 24, 2023

[wip] convert list_repos to a generator that yields pages codecov/shared#24

Closed

matt-codecov closed this as completed Aug 25, 2023

matt-codecov mentioned this issue Sep 27, 2023

Issues 136-162 #434

Closed

matt-codecov mentioned this issue Feb 6, 2024

feat: get repos from graphql codecov/shared#120

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch `torngit.github.list_repos()` to graphql, measure performance difference #139

Switch `torngit.github.list_repos()` to graphql, measure performance difference #139

matt-codecov commented Jul 31, 2023 •

edited

Loading

matt-codecov commented Aug 25, 2023

Switch torngit.github.list_repos() to graphql, measure performance difference #139

Switch torngit.github.list_repos() to graphql, measure performance difference #139

Comments

matt-codecov commented Jul 31, 2023 • edited Loading

matt-codecov commented Aug 25, 2023

Switch `torngit.github.list_repos()` to graphql, measure performance difference #139

Switch `torngit.github.list_repos()` to graphql, measure performance difference #139

matt-codecov commented Jul 31, 2023 •

edited

Loading