Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rationalize ref-counting around ChannelActionListener #102551

Conversation

original-brownbear
Copy link
Member

This is a bit of a follow-up to #102498 and is needed to enable the correct handling of search responses both when going over the wire and running node-local, in the same way the other change was needed to not lose a reference.

This listener (the channel action listener) was behaving differently from other listeners in that it would effectively result in decRef by 1 when invoked. This worked out so far by accounting for this fact in calling code but is not maintainable now that more and more ref-counted things are going to be passed into it => make it behave like any other listener and be neutral in respect to ref count by acquiring the reference that the channel will release.
Marking non-issue since there's no real bug here yet in production, but the behavior.

This listener was behaving differently from other listeners in that it
would effectively result in decRef by 1 when invoked. This worked out so
far by accounting for this fact in calling code but is not maintainable
now that more and more ref-counted things are going to be passed into
it => make it behave like any other listener and be neutral in respect
to ref count by acquiring the reference that the channel will release.
@original-brownbear original-brownbear added >non-issue :Distributed/Network Http and internode communication implementations labels Nov 23, 2023
@elasticsearchmachine elasticsearchmachine added v8.12.0 Team:Distributed Meta label for distributed team labels Nov 23, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@@ -378,7 +378,10 @@ public void onFailure(Exception e) {
null
),
new SearchShardTask(123L, "", "", "", null, Collections.emptyMap()),
result
result.delegateFailure((l, r) -> {
r.incRef();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of hack is fine, no need to do anything fancy, we never use this transport action with a future in prod.

@@ -391,7 +394,7 @@ public void onFailure(Exception e) {
);
PlainActionFuture<FetchSearchResult> listener = new PlainActionFuture<>();
service.executeFetchPhase(req, new SearchShardTask(123L, "", "", "", null, Collections.emptyMap()), listener);
listener.get().decRef();
listener.get();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, this doesn't need to return anything that has > 0 ref count, just using the future to wait on the action here.

@@ -648,8 +650,27 @@ private IndexShard getShard(ShardSearchRequest request) {
return indicesService.indexServiceSafe(request.shardId().getIndex()).getShard(request.shardId().id());
}

private static <T> void runAsync(Executor executor, CheckedSupplier<T, Exception> executable, ActionListener<T> listener) {
executor.execute(ActionRunnable.supply(listener, executable));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the problem, we implicitly assumed that the listener would always count down the reference here. This held for all current implementations which saw a ChannelActionListener but broke once I tried to use this with the multi-search response which can either go to a ChannelActionListener or some other listeners (which would then have to be aware of the fact that they need to count-down by one which would make the code needlessly brittle.

@Override
public void accept(ActionListener<T> l) throws Exception {
var res = executable.get();
try {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, following this pattern (instantiate -> try -> finally -> decRef) as closely as possible and not implicitly requiring listeners to ever count down a ref makes the code a lot easier to follow and extend (as can be seen in the test simplifications ... mod the futures :P).

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, although why are there not more consequences? Is it just that we've never had a meaningfully ref-counted response before?

@original-brownbear
Copy link
Member Author

Is it just that we've never had a meaningfully ref-counted response before?

We haven't the search responses would be the first ones that have client and other (directly through the transport service) usage it seems.

@original-brownbear
Copy link
Member Author

Actually hold that thought ... Ccr seems to be affected looking into it, sorry for the noise.

@DaveCTurner
Copy link
Contributor

Gotcha, ok. The tests suggest there's more to do here :)

@DaveCTurner
Copy link
Contributor

Lots of good .mustIncRef() candidates here too tho, can we get #102515 in first?

@@ -28,6 +28,7 @@ public ChannelActionListener(TransportChannel channel) {

@Override
public void onResponse(Response response) {
response.incRef(); // acquire reference that will be released by channel.sendResponse below
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On reflection I think it'd be nicer to push this down a level or two and make TransportChannel#sendResponse refcount-neutral. In particular it's kinda weird that OutboundHandler#sendRequest is already refcount-neutral whilst OutboundHandler#sendResponse is not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea it sucks but could we push that out to another PR? That's practically a rather big change it seems, just tried it. We have a number of spots outside of this listener where we call the channel.sendResponse() directly and I'd have to adjust all of those. That goes beyond the scope of what I'm trying to fix here. Note that this at least gets us closer to fixing the OutBoundHandler :) ok with you?

Copy link
Contributor

@DaveCTurner DaveCTurner Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah ok makes sense - I opened #102600 to make sure we don't forget

Comment on lines -415 to -418
while (origResp.hasReferences()) {
newResp.incRef();
origResp.decRef();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dnhatn could you take a look at the ESQL changes in this PR? Esp. here, I'm not sure killing all references of the message was intended to be part of the test. (It validates ownership assumption, the channel should only be able to kill its "own" reference.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop ensures that newResp has the same number of references as origResp. This change should be okay, assuming the test passes.

@original-brownbear
Copy link
Member Author

Thanks @alex-spies for the help here, @DaveCTurner could you also give this another look. Shouldn't have changed outside the ESQL part from the last time you reviewed I think, thanks!

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming CI is happy: I've not validated that this is everything that needs doing, but everything here looks good.

Suggested some mustIncRef usages, but we can do them in a followup if you'd prefer.

@@ -28,6 +28,7 @@ public ChannelActionListener(TransportChannel channel) {

@Override
public void onResponse(Response response) {
response.incRef(); // acquire reference that will be released by channel.sendResponse below
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
response.incRef(); // acquire reference that will be released by channel.sendResponse below
response.mustIncRef(); // acquire reference that will be released by channel.sendResponse below

@@ -686,6 +707,7 @@ private SearchPhaseResult executeQueryPhase(ShardSearchRequest request, SearchSh
final RescoreDocIds rescoreDocIds = context.rescoreDocIds();
context.queryResult().setRescoreDocIds(rescoreDocIds);
readerContext.setRescoreDocIds(rescoreDocIds);
// inc-ref query result because we close the SearchContext that references it in this try-with-resources block
context.queryResult().incRef();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
context.queryResult().incRef();
context.queryResult().mustIncRef();

@@ -783,6 +805,7 @@ public void executeQueryPhase(QuerySearchRequest request, SearchShardTask task,
final RescoreDocIds rescoreDocIds = searchContext.rescoreDocIds();
queryResult.setRescoreDocIds(rescoreDocIds);
readerContext.setRescoreDocIds(rescoreDocIds);
// inc-ref query result because we close the SearchContext that references it in this try-with-resources block
queryResult.incRef();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
queryResult.incRef();
queryResult.mustIncRef();

@@ -866,6 +889,7 @@ public void executeFetchPhase(ShardFetchRequest request, SearchShardTask task, A
executor.success();
}
var fetchResult = searchContext.fetchResult();
// inc-ref fetch result because we close the SearchContext that references it in this try-with-resources block
fetchResult.incRef();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fetchResult.incRef();
fetchResult.mustIncRef();

@@ -378,7 +378,10 @@ public void onFailure(Exception e) {
null
),
new SearchShardTask(123L, "", "", "", null, Collections.emptyMap()),
result
result.delegateFailure((l, r) -> {
r.incRef();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
r.incRef();
r.mustIncRef();

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ESQL part looks good to me.

Comment on lines -415 to -418
while (origResp.hasReferences()) {
newResp.incRef();
origResp.decRef();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop ensures that newResp has the same number of references as origResp. This change should be okay, assuming the test passes.

@original-brownbear
Copy link
Member Author

Thanks so much David and Nhat! Will merge as is but will do the mustIncRef in a follow-up, promise! :)

@original-brownbear original-brownbear merged commit 6f72a1c into elastic:main Nov 24, 2023
14 checks passed
@original-brownbear original-brownbear deleted the rationalize-channel-action-listener branch November 24, 2023 18:19
elasticsearchmachine pushed a commit that referenced this pull request Nov 24, 2023
…" (#102609)

This reverts commit 6f72a1c #102551 ,
sadly this is still causing ESQL test failures and we need to fix ESQL
some more before we can continue here.
alex-spies added a commit to alex-spies/elasticsearch that referenced this pull request Nov 27, 2023
leemthompo pushed a commit to leemthompo/elasticsearch that referenced this pull request Nov 27, 2023
…lastic#102638)

* Revert "Revert "Rationalize ref-counting around ChannelActionListener (elastic#102551)" (elastic#102609)"

This reverts commit fdf51ac.

* Fix response ownership in ESQL
timgrein pushed a commit to timgrein/elasticsearch that referenced this pull request Nov 30, 2023
This listener was behaving differently from other listeners in that it
would effectively result in decRef by 1 when invoked. This worked out so
far by accounting for this fact in calling code but is not maintainable
now that more and more ref-counted things are going to be passed into
it => make it behave like any other listener and be neutral in respect
to ref count by acquiring the reference that the channel will release.

Co-authored-by: Alexander Spies <alexander.spies@elastic.co>
timgrein pushed a commit to timgrein/elasticsearch that referenced this pull request Nov 30, 2023
…c#102551)" (elastic#102609)

This reverts commit 6f72a1c elastic#102551 ,
sadly this is still causing ESQL test failures and we need to fix ESQL
some more before we can continue here.
timgrein pushed a commit to timgrein/elasticsearch that referenced this pull request Nov 30, 2023
…lastic#102638)

* Revert "Revert "Rationalize ref-counting around ChannelActionListener (elastic#102551)" (elastic#102609)"

This reverts commit fdf51ac.

* Fix response ownership in ESQL
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Dec 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Network Http and internode communication implementations >non-issue Team:Distributed Meta label for distributed team v8.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants