Subgraph support for query batching #4661

garypen · 2024-02-15T11:54:11Z

This project is an extension of the existing work to support client side batching in the router.
The current implementation is experimental and is publicly documented.
The additional work to enable batching requests to subgraphs is captured in this issue.
Currently the concept of a batch is preserved until the end of the RouterRequest processing. At this point we convert each batch request item into a separate SupergraphRequest. These are then planned and executed concurrently within the router and re-assembled into a batch when they complete. It's important to note that, with this implementation, the concept of a batch, from the perspective of an executing router, now disappears and each request is planned and executed separately.
This extension will modify the router so that the concept of a batch is preserved, at least outwardly, so that multiple subgraph requests are "batched" (in exactly the same format as a client batch request) for onward transmission to subgraphs. The goal of this work is to provide an optimisation by reducing the number of round-trips to a subgraph from the router.
Additionally, the work will address an unresolved issue from the existing experimental implementation and promote the existing implementation from experimental to fully supported.

Fixes #2002

Review Guidance

This is a fairly big PR, so I've written these notes to help make the review more approachable.

The most important files to review are (in order):

First read the documentation. Hopefully that will make clear how this feature works. I've picked these files as being most important (and ordered them for review) because:

router service => This is where we spot incoming batches and create context BatchQuery items to manage them through the router. We also re-assemble them on the way back and identify any batches which may need to be cancelled.

supergraph service => Here we pick up the information about how many fetches we believe each BatchQuery will need to make.

plan => The new query_hashes() does this fetch identification for us. This is the most important function in this feature.

subgraph service => Here's is where we intercept the calls to subgraphs and park threads to wait for batch execution to be performed. We do a lot of work here, so this is where most of the intrusive changes are: assembling and dis-assembling batches and managing the co-ordination between a number of parked tasks.

batching => This is the implementation of batch co-ordination. Each batch has a task which manages a variety of channels to facilitate communication between the incoming batches, waiting tasks and outgoing (to subgraph) batches. I'm suggesting reading this after reading through the service changes because it should mainly just be implementation details and you will be able to follow what is happening without knowing all this detail initially. Once you understand the changes to the services, you will need to read this code. Feel free to peek ahead though if that's how you like to review stuff.

There are still a couple of TODOs which will be resolved early next week. They are both related to how we handle context cloning, so a decision is still pending there.

Obviously all the files need to be reviewed, but the remaining files should be fairly mechanical/straight-forward.

Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

Changes are compatible¹
Documentation² completed
Performance impact assessed and acceptable
Tests added and passing³
- Unit Tests
- Integration Tests
- ~~[ ] Manual Tests~~

Exceptions

Note any exceptions here

Notes

It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. ↩
Configuration is an important part of many changes. Where applicable please try to document configuration examples. ↩
Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. ↩

This version isn't working, it hangs during batch assembly. If I take the `receiver.await` out, then batch assembly does proceed, so something is stopping the execution from proceeding further up the pipeline is my guess.

As long as no elements of your query have fetches than "require" results, then your batch query will work. I still have more work to do to complete the "requires" components also: - There's a bunch of really horrible byte soup that needs to be cleaned up along with an absolute ton of re-factoring and functional decomposition.

Once we've processed up all the "no requires" fetches, we can no longer create batches from remaining fetches. I get the same hanging waiter problem that I encountered earlier in prototyping. For now, I've implemented a solution which creates one set of batches and then forces remaining queries to execute independently. Hopefully that will be enough to satisfy the requirements of the feature.

I don't need two separate functions for counting fetches. Just provide a parameter.

I forgot that I had to fixup the subgraph_fetches function with my check to `include_requires`.

Add some content to the existing pages for early review by docs.

router-perf · 2024-02-15T11:54:42Z

Outlines new capabilities and adds illustrative configuration.

Isolate the batching support into its own module and start to reduce code duplication.

sigh...

BatchDetails -> BatchQuery SharedBatchDetails -> Batch which better reflects their nature.

Replace the 4 element tuple with a struct which captures all the relevant data for waiting for a request. At some point the impl for a Waiter will have functions that cover formatting and transforming from different representations, but not yet...

Rename some things to make it more obvious what's going on. Get ready to add some utility functions to batching.rs

To move some of the code into a function.

Start trying to break down the megafunction that is call_http().

We can't leave waiters in a Batch, since that would mean a request had been "lost" with all the negative implications for that client. So, until I figure out the details, impl Drop for Batch and panic if self.waiters is not empty.

We need a new error to represent things that can go wrong when we are manipulating batches. This new type of FetchError does the job for us. I've also cleaned up some of the interactions with other functions, and tracing.

Extract out the http_response_into_graphql_response() function and improve the building of arrays of batch responses. Also: - add checks/invariants for array length comparison - indicate when code is unreachable - add some comments

Add the optional "subgraph" attribute to capture batching metrics.

We need to add support for subgraph filtering to the subgraph service, but we can't do that until the configuration work is completed. Add a comment so we don't forget about it.

This might be reverted in a future validation compatibility PR, but for now update the snapshot to match the current validation behaviour.

Correct link. Co-authored-by: Maria Elisabeth Schreiber <maria.schreiber@apollographql.com>

apollo-router/src/plugins/traffic_shaping/deduplication.rs

apollo-router/src/services/subgraph_service.rs

apollo-router/tests/integration/batching.rs

o0Ignition0o

Couple of nits, nothing that prevents a merge, lgtm.

apollo-router/src/services/subgraph_service.rs

apollo-router/src/services/router/service.rs

docs/source/executing-operations/query-batching.mdx

garypen added 7 commits February 12, 2024 12:58

Some functionality for assembling batches

eeb4be5

This version isn't working, it hangs during batch assembly. If I take the `receiver.await` out, then batch assembly does proceed, so something is stopping the execution from proceeding further up the pipeline is my guess.

Remove some of the code duplication

49d2bfa

I don't need two separate functions for counting fetches. Just provide a parameter.

fix my code reduction

e76c0df

I forgot that I had to fixup the subgraph_fetches function with my check to `include_requires`.

Add some draft documentation for early evaluation by docs

d5e92d3

Add some content to the existing pages for early review by docs.

Merge branch 'dev' into garypen/2002-subgraph-batching

5ffd0c3

garypen self-assigned this Feb 15, 2024

This comment has been minimized.

Sign in to view

garypen added 20 commits February 15, 2024 12:52

Draft changeset for the feature

553cb42

Outlines new capabilities and adds illustrative configuration.

create batching.rs and start to organise code more professionally

d99648e

Isolate the batching support into its own module and start to reduce code duplication.

remember to include the new source file in commits

ac9a0f8

sigh...

Merge branch 'dev' into garypen/2002-subgraph-batching

d2a3e10

Rename core batching structs

6e0b580

BatchDetails -> BatchQuery SharedBatchDetails -> Batch which better reflects their nature.

cleanup some of the code and add some comments about future work

110c9ba

Rename some things to make it more obvious what's going on. Get ready to add some utility functions to batching.rs

Implement Waiter::assemble_batch

df6e4bf

To move some of the code into a function.

extract get_uri_details() fn

015c860

Start trying to break down the megafunction that is call_http().

impl Drop for Batch until I figure out implications

c9bbdf5

We can't leave waiters in a Batch, since that would mean a request had been "lost" with all the negative implications for that client. So, until I figure out the details, impl Drop for Batch and panic if self.waiters is not empty.

Merge branch 'dev' into garypen/2002-subgraph-batching

209e04d

Introduce SubrequestBatchingError and clean up code a little

76e3aea

We need a new error to represent things that can go wrong when we are manipulating batches. This new type of FetchError does the job for us. I've also cleaned up some of the interactions with other functions, and tracing.

Merge branch 'dev' into garypen/2002-subgraph-batching

9006b04

Merge branch 'dev' into garypen/2002-subgraph-batching

b4f3d1e

Merge branch 'dev' into garypen/2002-subgraph-batching

e3fef92

Clean up the batching functionality a bit more

cf254e6

Extract out the http_response_into_graphql_response() function and improve the building of arrays of batch responses. Also: - add checks/invariants for array length comparison - indicate when code is unreachable - add some comments

Merge branch 'dev' into garypen/2002-subgraph-batching

7793bb7

Merge branch 'dev' into garypen/2002-subgraph-batching

d1b7abd

add subgraph support to batching metrics

b19a267

Add the optional "subgraph" attribute to capture batching metrics.

Add a comment about the configuration work enabling subgraph filtering

5665508

We need to add support for subgraph filtering to the subgraph service, but we can't do that until the configuration work is completed. Add a comment so we don't forget about it.

garypen and others added 6 commits April 12, 2024 10:15

changes in validation mean we need to change our snapshot

a01d118

This might be reverted in a future validation compatibility PR, but for now update the snapshot to match the current validation behaviour.

Merge branch 'dev' into garypen/2002-subgraph-batching

ff7df9c

Update docs/source/executing-operations/query-batching.mdx

cd81ab7

Correct link. Co-authored-by: Maria Elisabeth Schreiber <maria.schreiber@apollographql.com>

Merge branch 'dev' into garypen/2002-subgraph-batching

396dcc2

update snapshot

7267daa

Merge branch 'dev' into garypen/2002-subgraph-batching

acb06cf

bnjjj approved these changes Apr 16, 2024

View reviewed changes

Geal requested changes Apr 16, 2024

View reviewed changes

o0Ignition0o approved these changes Apr 16, 2024

View reviewed changes

apollo-router/src/services/subgraph_service.rs Outdated Show resolved Hide resolved

apollo-router/src/services/router/service.rs Outdated Show resolved Hide resolved

garypen added 3 commits April 16, 2024 10:32

Merge branch 'dev' into garypen/2002-subgraph-batching

de2654c

code review comments

c33d0ca

revert code review comment that lint didn't like

33bafee

garypen requested a review from Geal April 16, 2024 10:27

garypen enabled auto-merge (squash) April 16, 2024 10:32

chandrikas reviewed Apr 16, 2024

View reviewed changes

docs/source/executing-operations/query-batching.mdx Show resolved Hide resolved

garypen disabled auto-merge April 16, 2024 11:01

update docs to add enterprise stuff

feb1769

Geal approved these changes Apr 16, 2024

View reviewed changes

garypen enabled auto-merge (squash) April 16, 2024 11:08

garypen merged commit ca1d1ef into dev Apr 16, 2024
11 of 13 checks passed

garypen deleted the garypen/2002-subgraph-batching branch April 16, 2024 11:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subgraph support for query batching #4661

Subgraph support for query batching #4661

garypen commented Feb 15, 2024 •

edited

Loading

This comment has been minimized.

router-perf bot commented Feb 15, 2024

o0Ignition0o left a comment

Subgraph support for query batching #4661

Subgraph support for query batching #4661

Conversation

garypen commented Feb 15, 2024 • edited Loading

Footnotes

This comment has been minimized.

router-perf bot commented Feb 15, 2024

o0Ignition0o left a comment

Choose a reason for hiding this comment

garypen commented Feb 15, 2024 •

edited

Loading