[wip] convert list_repos to a generator that yields pages #24

matt-codecov · 2023-08-15T00:15:53Z

[UNFINISHED] we apparently need to dedup results for gitlab. i started it but am calling it for the night

codecov/worker#62
codecov/engineering-team#6
codecov/engineering-team#101

per adrian, new customers who have lots of repos between their orgs have an unpleasant first sync: the ui does nothing for potentially minutes until finally all the repos explode into view. instead, he wants the ui to continually refetch the currently-showing set of repos. so initially the user sees 0, and then 13, and then 40, and then 50 + a "load more" button, and so on.

currently we fetch page after page of repos from the git service provider, flatten the list, and return it. we then iterate through the complete set of repos and make db updates. fetching is slow and frontloading all of it before making any db updates undermines adrian's ui tweak, but if we instead alternate fetching a page and updating the db, the initial sync ux should feel more responsive

so, this PR converts list_repos() and list_repos_using_integration() to python generators which yield page after page. a corresponding change is needed in worker to update its usages of these functions from:

repos = git.list_repos()
for repo in repos:
    ...

to:

async for page in git.list_repos():
    for repo in page:
        ...

Legal Boilerplate

Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. In 2022 this entity acquired Codecov and as result Sentry is going to need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.

matt-codecov · 2023-08-24T21:45:14Z

quick thought: a cheap way to address the gitlab dedup issue is to run the function unmodified and then yield the full set of repos as a single "page". basically opting gitlab out of the more responsive UX. there might be a better solution, just writing this down so i remember after i return from vacation

bigger thought: this PR would take the same net time as the current behavior, it just rearranges the work so the UX will be more responsive. but actually i think we can also decrease the total time required at the cost of one additional API request to get the total number of viewable repositories. github's graphql api exposes this, and if other providers don't then they can just use a normal generator as in this PR

for github, this would look something like:

query = """
query {
  viewer {
    repositories(ownerAffiliations: [OWNER, COLLABORATOR, ORGANIZATION_MEMBER]) {
      totalCount
    }
  }
}
"
repos_count = await api('/graphql', data=query)
pages = repos_count // page_size
if repos_count % page_size > 0:
    pages += 1

futures = (api(f'/user/repositories?page={page}' for page in range(1, pages+1))
for future in asyncio.as_completed(futures):
    yield from future

the current approach will not continue fetching until the caller requests another page. this new approach would continue fetching after yielding each page to the caller, and it also can fetch pages concurrently. i've seen syncs with as many as 100 pages and such cases should be a good deal faster using this logic.

some callouts:

big assumption, which should be vetted, is that the GH repositories GraphQL query returns the same set of repositories as the REST API
have not checked whether this
two new edge cases where repos are created/deleted at page boundaries while fetching
- if we query the count and get 101, and then a repo is deleted while we're fetching, the last page will be empty or error
- if we query the count and get 100, and then a repo is created while we're fetching, we'll fetch one too few pages
this is incompatible with Switch torngit.github.list_repos() to graphql, measure performance difference engineering-team#139. the GH REST API uses numbered pages which lets us precompute all the URLs we need to hit. the GH GraphQL API uses cursor-based pagination where fetching each page requires information from the previous page, and there's no way to predict all the cursors upfront

matt-codecov · 2023-08-25T01:10:09Z

https://gist.github.com/matt-codecov/6c6b9a3e652ad41103d2557761eafac0

i compared three approaches using a token for my account

current behavior
pre-fetching the total + concurrently downloading
current behavior but using the graphql api

results:

with page_size=10 and 140 repos to fetch, the prefetching approach is notably faster than the current behavior and graphql is about the same speed
with page_size=100 (prod value) and 140 repos to fetch, the prefetching approach is still faster, and the graphql is actually a little faster too but not as good as prefetching

all three return the same set of repos, except strangely the graphql api picks up one additional repo (https://github.com/aj-codecov/codecov-client). no idea why

from a quick glance, couldn't find a way to prefetch the repo count on bitbucket or gitlab. so this'd be a github-only optimization and the other two would just use generators as done here

matt-codecov · 2023-09-22T21:41:05Z

replaced by #46

[wip] convert list_repos to a generator that yields pages

ff739d6

matt-codecov requested a review from adrian-codecov August 15, 2023 00:15

matt-codecov mentioned this pull request Aug 15, 2023

perf: update shared, use async generator for list_repos codecov/worker#62

Closed

matt-codecov mentioned this pull request Aug 25, 2023

Switch torngit.github.list_repos() to graphql, measure performance difference codecov/engineering-team#139

Closed

matt-codecov mentioned this pull request Sep 22, 2023

perf: create generator versions of list_repos() and optimize github's #46

Merged

matt-codecov closed this Sep 22, 2023

matt-codecov deleted the matt/list-repos-generator branch September 22, 2023 21:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] convert list_repos to a generator that yields pages #24

[wip] convert list_repos to a generator that yields pages #24

matt-codecov commented Aug 15, 2023 •

edited

Loading

matt-codecov commented Aug 24, 2023 •

edited

Loading

matt-codecov commented Aug 25, 2023

matt-codecov commented Sep 22, 2023

[wip] convert list_repos to a generator that yields pages #24

[wip] convert list_repos to a generator that yields pages #24

Conversation

matt-codecov commented Aug 15, 2023 • edited Loading

Legal Boilerplate

matt-codecov commented Aug 24, 2023 • edited Loading

matt-codecov commented Aug 25, 2023

matt-codecov commented Sep 22, 2023

matt-codecov commented Aug 15, 2023 •

edited

Loading

matt-codecov commented Aug 24, 2023 •

edited

Loading