Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SearchResponse reference count leaks in ML module #103009

Merged
merged 4 commits into from
Dec 6, 2023

Conversation

original-brownbear
Copy link
Member

Fixing all kinds of leaks in both ml prod and test code. Added a new utility for a very common operation in tests that I'm planning on replacing other use sites with in a follow up.

part of #102030

Fixing all kinds of leaks in both ml prod and test code.
Added a new utility for a very common operation in tests that
I'm planning on replacing other use sites with in a follow up.
@original-brownbear original-brownbear added >test Issues or PRs that are addressing/adding tests :Search/Search Search-related issues that do not fall into other categories :ml Machine learning labels Dec 5, 2023
@elasticsearchmachine elasticsearchmachine added Team:Search Meta label for search team Team:ML Meta label for the ML team v8.13.0 labels Dec 5, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on some trickier parts of the code.

Asking @droberts195 to give a look over some particular lines that cause me some pause.

timingStatsReporter.reportSearchDuration(searchResponse.getTook());
scrollId = searchResponse.getScrollId();
SearchHit hits[] = searchResponse.getHits().getHits();
return processAndConsumeSearchHits(hits);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think processAndConsumeSearchHits(hits); actually consumes the hits and thus a reference isn't required any longer (its not async).

@droberts195 could you confirm?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this point, all this change does is release the reference in this method/block because we instantiated it here. Whether or not processAndConsumeSearchHits requires additional ref counting on the hits is a question for a follow-up PR. For now the hit instances aren't ref counted and releasing the response is a noop (except for the leak-tracking) in the next incoming step that we're preparing here. We shouldn't hold on to a reference to the response if we only care about the hits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@original-brownbear OK, thank you for clarification. I misunderstood as I thought the whole point of ref counting was allowing hits to re-use buffered bytes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of ref counting was allowing hits to re-use buffered bytes.

Right but we want granularity at the SearchHit level here in the end, otherwise holding on to a single hit could hold on to a much much larger buffer containing other hits in the same response.
We need the ref-counting on the search response to enable ref-counting search hits at the level of these changes but each SearchHit itself will be tracked individually in the end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the question was about whether this code is dangerous now:

// hack to remove the reference from object. This object can be big and consume alot of memory.
// We are removing it as soon as we process it.
hits[i] = null;

The reason that's there is that as we iterate the hits we're building up a similar sized list of different objects derived from the hits. If garbage collection needs to run half way through iterating the hits we want it to be able to reclaim the half that we've already processed. Otherwise as we near the end of the hits we'll be temporarily using around double the memory of the hits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will become dangerous once the search hits are ref counted, but we won't miss that with the leak tracking infrastructure in place. I'll address this in the PR that makes the hits ref counted, nothing to worry about yet :)

}
SearchHit stateDoc = stateResponse.getHits().getAt(0);
logger.debug(() -> format("[%s] Restoring state document [%s]", config.jobId(), stateDoc.getId()));
StateToProcessWriterHelper.writeStateToStream(stateDoc.getSourceRef(), restoreStream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droberts195 could you confirm here as well? Writing to the restoreStream here doesn't require any references to the response once finished?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as with the other question, let's not worry about this here. We will make SearchHit ref-counted and can worry about this part when we do. We shouldn't keep track of the response as a proxy for an individual hit.

Comment on lines +90 to +92
writeStateToStream(stateResponse.getHits().getAt(0).getSourceRef(), restoreStream);
} finally {
stateResponse.decRef();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droberts195 sorry for another ping, just wanting to double check all these things and that decRef is good and writeStateToStream doesn't asynchronously require a reference

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no chance of a problem here as writeStateToStream is synchronous. By the time it returns all data should have been copied into a named pipe.

return mapHits(searchResponse);
try {
scrollId = searchResponse.getScrollId();
return mapHits(searchResponse);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good, the mapHits always creates new objects from what I can tell. I would be scared if it kept a reference to a search hit.

cc @droberts195 ^ from what I read, this is ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is OK. The _source of the hits gets parsed by the appropriate result parser synchronously. Then a list of ML results objects gets returned. So by the time the method returns the search hits are no longer needed or referenced.

} else {
params = EMPTY_PARAMS;
}
return new SearchTransform.Result(request, new Payload.XContent(resp, params));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems ok, while this isn't ML related but instead @elastic/es-core-infra

@benwtrent
Copy link
Member

Since this touches a ton of production ML code, it would be really nice to have a @elastic/ml-core team member review, preferably @droberts195 as he is most familiar with all aspects of these various usages of SearchResponse.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we got some green CI :)

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I couldn't see anything wrong with the changes in this PR. Probably the biggest risk is if there's a place that's been missed and isn't in this PR. Is there a good way to detect places that have been missed?

@original-brownbear
Copy link
Member Author

Thanks everyone!

@original-brownbear original-brownbear merged commit 48144ba into elastic:main Dec 6, 2023
15 checks passed
@original-brownbear original-brownbear deleted the fix-leaks-ml branch December 6, 2023 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning :Search/Search Search-related issues that do not fall into other categories Team:ML Meta label for the ML team Team:Search Meta label for search team >test Issues or PRs that are addressing/adding tests v8.13.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants