Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] convert list_repos to a generator that yields pages #24

Closed
wants to merge 1 commit into from

Conversation

matt-codecov
Copy link
Contributor

@matt-codecov matt-codecov commented Aug 15, 2023

[UNFINISHED] we apparently need to dedup results for gitlab. i started it but am calling it for the night

codecov/worker#62
codecov/engineering-team#6
codecov/engineering-team#101

per adrian, new customers who have lots of repos between their orgs have an unpleasant first sync: the ui does nothing for potentially minutes until finally all the repos explode into view. instead, he wants the ui to continually refetch the currently-showing set of repos. so initially the user sees 0, and then 13, and then 40, and then 50 + a "load more" button, and so on.

currently we fetch page after page of repos from the git service provider, flatten the list, and return it. we then iterate through the complete set of repos and make db updates. fetching is slow and frontloading all of it before making any db updates undermines adrian's ui tweak, but if we instead alternate fetching a page and updating the db, the initial sync ux should feel more responsive

so, this PR converts list_repos() and list_repos_using_integration() to python generators which yield page after page. a corresponding change is needed in worker to update its usages of these functions from:

repos = git.list_repos()
for repo in repos:
    ...

to:

async for page in git.list_repos():
    for repo in page:
        ...

Legal Boilerplate

Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. In 2022 this entity acquired Codecov and as result Sentry is going to need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.

@matt-codecov
Copy link
Contributor Author

matt-codecov commented Aug 24, 2023

quick thought: a cheap way to address the gitlab dedup issue is to run the function unmodified and then yield the full set of repos as a single "page". basically opting gitlab out of the more responsive UX. there might be a better solution, just writing this down so i remember after i return from vacation

bigger thought: this PR would take the same net time as the current behavior, it just rearranges the work so the UX will be more responsive. but actually i think we can also decrease the total time required at the cost of one additional API request to get the total number of viewable repositories. github's graphql api exposes this, and if other providers don't then they can just use a normal generator as in this PR

for github, this would look something like:

query = """
query {
  viewer {
    repositories(ownerAffiliations: [OWNER, COLLABORATOR, ORGANIZATION_MEMBER]) {
      totalCount
    }
  }
}
"
repos_count = await api('/graphql', data=query)
pages = repos_count // page_size
if repos_count % page_size > 0:
    pages += 1

futures = (api(f'/user/repositories?page={page}' for page in range(1, pages+1))
for future in asyncio.as_completed(futures):
    yield from future

the current approach will not continue fetching until the caller requests another page. this new approach would continue fetching after yielding each page to the caller, and it also can fetch pages concurrently. i've seen syncs with as many as 100 pages and such cases should be a good deal faster using this logic.

some callouts:

  • big assumption, which should be vetted, is that the GH repositories GraphQL query returns the same set of repositories as the REST API
  • have not checked whether this
  • two new edge cases where repos are created/deleted at page boundaries while fetching
    • if we query the count and get 101, and then a repo is deleted while we're fetching, the last page will be empty or error
    • if we query the count and get 100, and then a repo is created while we're fetching, we'll fetch one too few pages
  • this is incompatible with Switch torngit.github.list_repos() to graphql, measure performance difference engineering-team#139. the GH REST API uses numbered pages which lets us precompute all the URLs we need to hit. the GH GraphQL API uses cursor-based pagination where fetching each page requires information from the previous page, and there's no way to predict all the cursors upfront

@matt-codecov
Copy link
Contributor Author

https://gist.github.com/matt-codecov/6c6b9a3e652ad41103d2557761eafac0

i compared three approaches using a token for my account

  • current behavior
  • pre-fetching the total + concurrently downloading
  • current behavior but using the graphql api

results:

  • with page_size=10 and 140 repos to fetch, the prefetching approach is notably faster than the current behavior and graphql is about the same speed
  • with page_size=100 (prod value) and 140 repos to fetch, the prefetching approach is still faster, and the graphql is actually a little faster too but not as good as prefetching

all three return the same set of repos, except strangely the graphql api picks up one additional repo (https://github.com/aj-codecov/codecov-client). no idea why

from a quick glance, couldn't find a way to prefetch the repo count on bitbucket or gitlab. so this'd be a github-only optimization and the other two would just use generators as done here

@matt-codecov
Copy link
Contributor Author

replaced by #46

@matt-codecov matt-codecov deleted the matt/list-repos-generator branch September 22, 2023 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant