Fix SearchResponse reference count leaks in ML module #103009

original-brownbear · 2023-12-05T18:36:37Z

Fixing all kinds of leaks in both ml prod and test code. Added a new utility for a very common operation in tests that I'm planning on replacing other use sites with in a follow up.

part of #102030

Fixing all kinds of leaks in both ml prod and test code. Added a new utility for a very common operation in tests that I'm planning on replacing other use sites with in a follow up.

elasticsearchmachine · 2023-12-05T18:37:01Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2023-12-05T18:37:02Z

Pinging @elastic/es-search (Team:Search)

benwtrent

Some comments on some trickier parts of the code.

Asking @droberts195 to give a look over some particular lines that cause me some pause.

...elasticsearch/xpack/ml/datafeed/extractor/aggregation/CompositeAggregationDataExtractor.java

benwtrent · 2023-12-06T16:08:02Z

.../src/main/java/org/elasticsearch/xpack/ml/datafeed/extractor/scroll/ScrollDataExtractor.java

+            timingStatsReporter.reportSearchDuration(searchResponse.getTook());
+            scrollId = searchResponse.getScrollId();
+            SearchHit hits[] = searchResponse.getHits().getHits();
+            return processAndConsumeSearchHits(hits);


I think processAndConsumeSearchHits(hits); actually consumes the hits and thus a reference isn't required any longer (its not async).

@droberts195 could you confirm?

Not sure I understand this point, all this change does is release the reference in this method/block because we instantiated it here. Whether or not processAndConsumeSearchHits requires additional ref counting on the hits is a question for a follow-up PR. For now the hit instances aren't ref counted and releasing the response is a noop (except for the leak-tracking) in the next incoming step that we're preparing here. We shouldn't hold on to a reference to the response if we only care about the hits.

@original-brownbear OK, thank you for clarification. I misunderstood as I thought the whole point of ref counting was allowing hits to re-use buffered bytes.

of ref counting was allowing hits to re-use buffered bytes.

Right but we want granularity at the SearchHit level here in the end, otherwise holding on to a single hit could hold on to a much much larger buffer containing other hits in the same response.
We need the ref-counting on the search response to enable ref-counting search hits at the level of these changes but each SearchHit itself will be tracked individually in the end.

I guess the question was about whether this code is dangerous now:

elasticsearch/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/datafeed/extractor/scroll/ScrollDataExtractor.java

Lines 206 to 208 in d68670f

// hack to remove the reference from object. This object can be big and consume alot of memory.

// We are removing it as soon as we process it.

hits[i] = null;

The reason that's there is that as we iterate the hits we're building up a similar sized list of different objects derived from the hits. If garbage collection needs to run half way through iterating the hits we want it to be able to reclaim the half that we've already processed. Otherwise as we near the end of the hits we'll be temporarily using around double the memory of the hits.

It will become dangerous once the search hits are ref counted, but we won't miss that with the leak tracking infrastructure in place. I'll address this in the PR that makes the hits ref counted, nothing to worry about yet :)

benwtrent · 2023-12-06T16:10:47Z

...in/ml/src/main/java/org/elasticsearch/xpack/ml/dataframe/process/NativeAnalyticsProcess.java

+                    }
+                    SearchHit stateDoc = stateResponse.getHits().getAt(0);
+                    logger.debug(() -> format("[%s] Restoring state document [%s]", config.jobId(), stateDoc.getId()));
+                    StateToProcessWriterHelper.writeStateToStream(stateDoc.getSourceRef(), restoreStream);


@droberts195 could you confirm here as well? Writing to the restoreStream here doesn't require any references to the response once finished?

Same issue as with the other question, let's not worry about this here. We will make SearchHit ref-counted and can worry about this part when we do. We shouldn't keep track of the response as a proxy for an individual hit.

.../main/java/org/elasticsearch/xpack/ml/inference/persistence/ChunkedTrainedModelRestorer.java

.../plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/persistence/JobResultsPersister.java

benwtrent · 2023-12-06T16:18:33Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/persistence/StateStreamer.java

+                    writeStateToStream(stateResponse.getHits().getAt(0).getSourceRef(), restoreStream);
+                } finally {
+                    stateResponse.decRef();


@droberts195 sorry for another ping, just wanting to double check all these things and that decRef is good and writeStateToStream doesn't asynchronously require a reference

There's no chance of a problem here as writeStateToStream is synchronous. By the time it returns all data should have been copied into a named pipe.

benwtrent · 2023-12-06T16:21:15Z

.../ml/src/main/java/org/elasticsearch/xpack/ml/utils/persistence/BatchedDocumentsIterator.java

-        return mapHits(searchResponse);
+        try {
+            scrollId = searchResponse.getScrollId();
+            return mapHits(searchResponse);


This is good, the mapHits always creates new objects from what I can tell. I would be scared if it kept a reference to a search hit.

cc @droberts195 ^ from what I read, this is ok.

Yes, I think this is OK. The _source of the hits gets parsed by the appropriate result parser synchronously. Then a list of ML results objects gets returned. So by the time the method returns the search hits are no longer needed or referenced.

benwtrent · 2023-12-06T16:24:09Z

...rc/main/java/org/elasticsearch/xpack/watcher/transform/search/ExecutableSearchTransform.java

+                } else {
+                    params = EMPTY_PARAMS;
+                }
+                return new SearchTransform.Result(request, new Payload.XContent(resp, params));


This seems ok, while this isn't ML related but instead @elastic/es-core-infra

benwtrent · 2023-12-06T17:52:07Z

Since this touches a ton of production ML code, it would be really nice to have a @elastic/ml-core team member review, preferably @droberts195 as he is most familiar with all aspects of these various usages of SearchResponse.

benwtrent

Once we got some green CI :)

droberts195

LGTM

I couldn't see anything wrong with the changes in this PR. Probably the biggest risk is if there's a place that's been missed and isn't in this PR. Is there a good way to detect places that have been missed?

original-brownbear · 2023-12-06T19:11:41Z

Thanks everyone!

Fix SearchResponse reference count leaks in ML module

90c36bf

Fixing all kinds of leaks in both ml prod and test code. Added a new utility for a very common operation in tests that I'm planning on replacing other use sites with in a follow up.

original-brownbear added >test Issues or PRs that are addressing/adding tests :Search/Search Search-related issues that do not fall into other categories :ml Machine learning labels Dec 5, 2023

elasticsearchmachine added Team:Search Meta label for search team Team:ML Meta label for the ML team v8.13.0 labels Dec 5, 2023

original-brownbear mentioned this pull request Dec 5, 2023

[Meta] Make SearchResponse use pooled buffers for search hits #102030

Closed

78 tasks

mark-vieira added v8.12.0 and removed v8.12.0 v8.13.0 labels Dec 5, 2023

original-brownbear requested review from benwtrent and DaveCTurner December 6, 2023 13:07

benwtrent reviewed Dec 6, 2023

View reviewed changes

original-brownbear added 2 commits December 6, 2023 17:48

Merge remote-tracking branch 'elastic/main' into fix-leaks-ml

f8ebc47

release asap

1324412

original-brownbear requested a review from benwtrent December 6, 2023 16:57

Merge remote-tracking branch 'elastic/main' into fix-leaks-ml

32728e2

benwtrent approved these changes Dec 6, 2023

View reviewed changes

droberts195 approved these changes Dec 6, 2023

View reviewed changes

original-brownbear merged commit 48144ba into elastic:main Dec 6, 2023
15 checks passed

original-brownbear deleted the fix-leaks-ml branch December 6, 2023 19:11

droberts195 added v8.13.0 and removed v8.12.0 labels Dec 7, 2023

droberts195 mentioned this pull request Dec 7, 2023

[CI] MlDistributedFailureIT testClusterWithTwoMlNodes_StopsDatafeed_GivenJobFailsOnReassign failing #103108

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SearchResponse reference count leaks in ML module #103009

Fix SearchResponse reference count leaks in ML module #103009

original-brownbear commented Dec 5, 2023

elasticsearchmachine commented Dec 5, 2023

elasticsearchmachine commented Dec 5, 2023

benwtrent left a comment

benwtrent Dec 6, 2023

original-brownbear Dec 6, 2023

benwtrent Dec 6, 2023

original-brownbear Dec 6, 2023

droberts195 Dec 6, 2023

original-brownbear Dec 6, 2023

benwtrent Dec 6, 2023

original-brownbear Dec 6, 2023

benwtrent Dec 6, 2023

droberts195 Dec 6, 2023

benwtrent Dec 6, 2023

droberts195 Dec 6, 2023

benwtrent Dec 6, 2023

benwtrent commented Dec 6, 2023

benwtrent left a comment

droberts195 left a comment

original-brownbear commented Dec 6, 2023

	// hack to remove the reference from object. This object can be big and consume alot of memory.
	// We are removing it as soon as we process it.
	hits[i] = null;

Fix SearchResponse reference count leaks in ML module #103009

Fix SearchResponse reference count leaks in ML module #103009

Conversation

original-brownbear commented Dec 5, 2023

elasticsearchmachine commented Dec 5, 2023

elasticsearchmachine commented Dec 5, 2023

benwtrent left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwtrent commented Dec 6, 2023

benwtrent left a comment

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

original-brownbear commented Dec 6, 2023