Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subgraph support for query batching #4661

Merged
merged 87 commits into from
Apr 16, 2024
Merged

Conversation

garypen
Copy link
Contributor

@garypen garypen commented Feb 15, 2024

This project is an extension of the existing work to support client side batching in the router.
The current implementation is experimental and is publicly documented.
The additional work to enable batching requests to subgraphs is captured in this issue.
Currently the concept of a batch is preserved until the end of the RouterRequest processing. At this point we convert each batch request item into a separate SupergraphRequest. These are then planned and executed concurrently within the router and re-assembled into a batch when they complete. It's important to note that, with this implementation, the concept of a batch, from the perspective of an executing router, now disappears and each request is planned and executed separately.
This extension will modify the router so that the concept of a batch is preserved, at least outwardly, so that multiple subgraph requests are "batched" (in exactly the same format as a client batch request) for onward transmission to subgraphs. The goal of this work is to provide an optimisation by reducing the number of round-trips to a subgraph from the router.
Additionally, the work will address an unresolved issue from the existing experimental implementation and promote the existing implementation from experimental to fully supported.

Fixes #2002


Review Guidance

This is a fairly big PR, so I've written these notes to help make the review more approachable.

  1. The most important files to review are (in order):

First read the documentation. Hopefully that will make clear how this feature works. I've picked these files as being most important (and ordered them for review) because:

router service => This is where we spot incoming batches and create context BatchQuery items to manage them through the router. We also re-assemble them on the way back and identify any batches which may need to be cancelled.

supergraph service => Here we pick up the information about how many fetches we believe each BatchQuery will need to make.

plan => The new query_hashes() does this fetch identification for us. This is the most important function in this feature.

subgraph service => Here's is where we intercept the calls to subgraphs and park threads to wait for batch execution to be performed. We do a lot of work here, so this is where most of the intrusive changes are: assembling and dis-assembling batches and managing the co-ordination between a number of parked tasks.

batching => This is the implementation of batch co-ordination. Each batch has a task which manages a variety of channels to facilitate communication between the incoming batches, waiting tasks and outgoing (to subgraph) batches. I'm suggesting reading this after reading through the service changes because it should mainly just be implementation details and you will be able to follow what is happening without knowing all this detail initially. Once you understand the changes to the services, you will need to read this code. Feel free to peek ahead though if that's how you like to review stuff.

  1. There are still a couple of TODOs which will be resolved early next week. They are both related to how we handle context cloning, so a decision is still pending there.

Obviously all the files need to be reviewed, but the remaining files should be fairly mechanical/straight-forward.


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Tests added and passing3
    • Unit Tests
    • Integration Tests
    • [ ] Manual Tests

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

This version isn't working, it hangs during batch assembly. If I take
the `receiver.await` out, then batch assembly does proceed, so something
is stopping the execution from proceeding further up the pipeline is my
guess.
As long as no elements of your query have fetches than "require"
results, then your batch query will work.

I still have more work to do to complete the "requires" components

also:
 - There's a bunch of really horrible byte soup that needs to be cleaned
   up along with an absolute ton of re-factoring and functional
   decomposition.
Once we've processed up all the "no requires" fetches, we can no longer
create batches from remaining fetches. I get the same hanging waiter
problem that I encountered earlier in prototyping.

For now, I've implemented a solution which creates one set of batches
and then forces remaining queries to execute independently. Hopefully
that will be enough to satisfy the requirements of the feature.
I don't need two separate functions for counting fetches. Just provide a
parameter.
I forgot that I had to fixup the subgraph_fetches function with my check
to `include_requires`.
Add some content to the existing pages for early review by docs.
@garypen garypen self-assigned this Feb 15, 2024

This comment has been minimized.

@router-perf
Copy link

router-perf bot commented Feb 15, 2024

CI performance tests

  • reload - Reload test over a long period of time at a constant rate of users
  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • large-request - Stress test with a 1 MB request payload
  • const - Basic stress test that runs with a constant number of users
  • no-graphos - Basic stress test, no GraphOS.
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • xxlarge-request - Stress test with 100 MB request payload
  • xlarge-request - Stress test with 10 MB request payload
  • step - Basic stress test that steps up the number of users over time

Outlines new capabilities and adds illustrative configuration.
Isolate the batching support into its own module and start to reduce
code duplication.
BatchDetails -> BatchQuery
SharedBatchDetails -> Batch

which better reflects their nature.
Replace the 4 element tuple with a struct which captures all the
relevant data for waiting for a request.

At some point the impl for a Waiter will have functions that cover
formatting and transforming from different representations, but not
yet...
Rename some things to make it more obvious what's going on. Get ready to
add some utility functions to batching.rs
To move some of the code into a function.
Start trying to break down the megafunction that is call_http().
We can't leave waiters in a Batch, since that would mean a request had
been "lost" with all the negative implications for that client.

So, until I figure out the details, impl Drop for Batch and panic if
self.waiters is not empty.
We need a new error to represent things that can go wrong when we are
manipulating batches.

This new type of FetchError does the job for us. I've also cleaned up
some of the interactions with other functions, and tracing.
Extract out the http_response_into_graphql_response() function and
improve the building of arrays of batch responses.

Also:
 - add checks/invariants for array length comparison
 - indicate when code is unreachable
 - add some comments
Add the optional "subgraph" attribute to capture batching metrics.
We need to add support for subgraph filtering to the subgraph service,
but we can't do that until the configuration work is completed.

Add a comment so we don't forget about it.
garypen and others added 6 commits April 12, 2024 10:15
This might be reverted in a future validation compatibility PR, but for
now update the snapshot to match the current validation behaviour.
Correct link.

Co-authored-by: Maria Elisabeth Schreiber <maria.schreiber@apollographql.com>
Copy link
Contributor

@o0Ignition0o o0Ignition0o left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of nits, nothing that prevents a merge, lgtm.

apollo-router/src/services/subgraph_service.rs Outdated Show resolved Hide resolved
apollo-router/src/services/router/service.rs Outdated Show resolved Hide resolved
@garypen garypen requested a review from Geal April 16, 2024 10:27
@garypen garypen enabled auto-merge (squash) April 16, 2024 10:32
@garypen garypen disabled auto-merge April 16, 2024 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow batched requests to subgraphs
10 participants