Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ref counting when using futures in AbstractClient #102498

Merged

Conversation

original-brownbear
Copy link
Member

Found this now that #102030 is getting closer to completion. It's fundamentally broken how the client deals with ref counted messages. We were only saved by the fact that currently we do do not handle any messages with non-noop ref counting through this client interface.

We fundamentally need to increment the ref count by one before assigning a ref counted value to a future result.
Otherwise, transport actions will have to understand the specific kind of listener they resolve and cannot decrement sent/consumed transport messages themselves.

marking non-issue since this hasn't caused any production trouble yet.

We fundamentally need to increment the ref count by one before
assigning a ref counted value to a future result.
Otherwise, trasnsport actions will have to understand the specific
kind of listener they resolve and cannot decrement sent/consumed
transport messages themselves.
@original-brownbear original-brownbear added >non-issue :Distributed/Network Http and internode communication implementations labels Nov 22, 2023
@elasticsearchmachine elasticsearchmachine added Team:Distributed Meta label for distributed team v8.12.0 labels Nov 22, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@@ -37,7 +37,6 @@ protected <Request extends ActionRequest, Response extends ActionResponse> void
ActionListener<Response> listener
) {
assertEquals(origin, threadPool().getThreadContext().getTransient(ThreadContext.ACTION_ORIGIN_TRANSIENT_NAME));
super.doExecute(action, request, listener);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither here nor in the other test do we care about resolving the listener. The result passed by the noop client is always null. I figured this was still better than adding a null check or hackily casting a non-null response in the noop-client ...

* on the result before it goes out of scope.
* @param <R> reference counted result type
*/
public static class ForRefCounted<R extends RefCounted> extends PlainActionFuture<R> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will probably need something similar for listenable future as well at some point. I didn't add any tests here yet before we're cool with the approach, I can though obviously it's trivial :)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I see the problem but this is a fairly awful hack IMO. That said I don't see great alternatives without somewhat invasive changes. The whole concept of blocking calls to the client is kinda bad, but does have production users (mostly Watcher and Monitoring it seems).

For a general-purpose utility I think the whole future should implement Releasable so that there's a proper incRef/decRef pair to keep things clear. As written here, there's an implicit assumption that there's only one call to .get() the result: if there are multiple .get() callers then they would need to coordinate the single required decRef() call amongst themselves. That seems super-trappy to me.

Also for a general-purpose utility we'd need to handle the case where the future is completed multiple times, in which we only incRef the result if it's the first completion.

Maybe it'd be better to make this specific behaviour local to the client rather than a general-purpose utility. I expect we've more chance of being able to assert that there's only a single call to .get() the result in that case.

Maybe also it should be a different API on the client for those few actions where refs matter so that it's clear to the caller they are taking a ref when calling .get(). Not 100% sure about that tho.

@original-brownbear
Copy link
Member Author

@DaveCTurner I admit it, it's hacky :) How about this, I moved the custom future into the client to limit the blast radius and I promise I'll look into removing the production usage of blocking get?
I'd mostly like to get this through quickly and without a lot of effort because:

  1. the blocking usage in prod must go away anyway
  2. this is pretty urgent because it blocks the search ref counting work :)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I can live with this for now to get the search response work through. I left a suggestion for a slightly safer implementation. Could you also add something to assert that .get() is only called once? And we need to do something with the timed .get() because a timeout would be a leak, although I'm not sure what exactly would be best there.

@DaveCTurner
Copy link
Contributor

Suggestion for a more general-purpose solution: #102507

@original-brownbear
Copy link
Member Author

@DaveCTurner I'll have a look later :) But for now, wdyt about this solution? I made it safer with your suggestion and added an assertion about this only getting called once. Good enough to unblock this one? :)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm misreading something but it looks like you've got both impls now?

image

@original-brownbear
Copy link
Member Author

original-brownbear commented Nov 23, 2023

I'm so sorry ... 🤦 cd4fdd5
guess the last 3 lines just became a noop and it still worked out randomly ...

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner
Copy link
Contributor

#102515 would have saved some of the mockery muckery I think

@original-brownbear
Copy link
Member Author

Thanks David!

@original-brownbear original-brownbear added the auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Nov 23, 2023
@elasticsearchmachine elasticsearchmachine merged commit 659b236 into elastic:main Nov 23, 2023
13 checks passed
@original-brownbear original-brownbear deleted the fix-client-ref-counting branch November 23, 2023 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed/Network Http and internode communication implementations >non-issue Team:Distributed Meta label for distributed team v8.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants