Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch torngit.github.list_repos() to graphql, measure performance difference #139

Closed
matt-codecov opened this issue Jul 31, 2023 · 1 comment
Assignees

Comments

@matt-codecov
Copy link

matt-codecov commented Jul 31, 2023

torngit.github.list_repos() uses GitHub's REST API to enumerate repos visible to a given user (or the currently signed-in one). The REST API responds with pages of 100 repos, each with all of its details filled out (example). If we convert these requests to use GitHub's GraphQL API, we can request just the fields we need and maybe see a performance boost.

GitHub has a GraphQL Explorer app that you can use to draft your queries: https://docs.github.com/en/graphql/overview/explorer
An example query would look something like:

query {
  viewer {
      login
      id
      repositories(
        ownerAffiliations: [OWNER, COLLABORATOR, ORGANIZATION_MEMBER]
        first: 10
        after: "Y3Vyc29yOnYyOpHOAXbfGQ=="
      ) {
        pageInfo {
          endCursor
          hasNextPage
        }
        edges {
          cursor
          node {
            owner {
              login
              id
            }
            id
            name
          }
        }
      }
    }
  }
}

first is the page size, after is the "cursor" indicating the end of the last page. pageInfo gives you endCursor, the end of the current page, and hasNextPage, which is what it says on the tin.

The ownerAffiliations thing is telling GitHub which repos to fetch. We want to test and understand any differences between what this returns and what the REST API returned.

Hopefully, this will be a significant performance improvement for users that, between all their orgs, are syncing hundreds of repos.
There's a timer wrapping the call to list_repos() in sync_repos which is viewable on this Elastic dashboard or is available in Grafana under the name worker_SyncReposTask_sync_repos_list_repos. Watch this metric after deploying to see whether the median, P75, or P95 are made lower or flatter. If so, look for more opportunities!

If networking was not the bottleneck here, or if the bloated REST API responses are actually compressed super well, then we won't see much of an impact here. But it's worth investigating!

@matt-codecov
Copy link
Author

https://gist.github.com/matt-codecov/6c6b9a3e652ad41103d2557761eafac0

i compared three approaches using a token for my account

  • current behavior
  • pre-fetching the total + concurrently downloading
  • current behavior but using the graphql api

the graphql API performed about as well as the current behavior with 140 repos to fetch (tried page size of 10 and 100) but the pre-fetching approach, which can't be done with the graphql api, performed significantly better.

the rest API uses numbered pages so if you know how many repos you have to fetch you can precompute the URL for each page and download them all concurrently. the graphQL API uses "cursors" for pagination and with that approach each page's query depends on data from the previous page (its endCursor field) so they can't all be downloaded concurrently

closing this issue, the prefetch/concurrent approach will probably be done in codecov/shared#24 when i'm back from vacation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants