Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] IndicesQueryCache.close assertion failure in internal cluster tests #37117

Closed
davidkyle opened this issue Jan 3, 2019 · 26 comments · Fixed by #38958 or #39111
Closed

[CI] IndicesQueryCache.close assertion failure in internal cluster tests #37117

davidkyle opened this issue Jan 3, 2019 · 26 comments · Fixed by #38958 or #39111
Assignees
Labels
:ml Machine learning :Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI

Comments

@davidkyle
Copy link
Member

The timing suggests these failures are related to the change moving the ML config out of the cluster state.

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=zulu8,nodes=virtual&&linux/152/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=zulu11,nodes=virtual&&linux/149/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+periodic/57/console

The log contains an assertion failure

java.lang.AssertionError: {org.apache.lucene.index.IndexReader$CacheKey@b62bec8=org.elasticsearch.indices.IndicesQueryCache$StatsAndCount@30fd04c3}
	at __randomizedtesting.SeedInfo.seed([374793047E6CA339:C9896ED516320174]:0)
	at org.elasticsearch.indices.IndicesQueryCache.close(IndicesQueryCache.java:191)
	at org.elasticsearch.core.internal.io.IOUtils.closeWhileHandlingException(IOUtils.java:145)
	at org.elasticsearch.core.internal.io.IOUtils.closeWhileHandlingException(IOUtils.java:130)
	at org.elasticsearch.indices.IndicesService.doClose(IndicesService.java:296)
	at org.elasticsearch.common.component.AbstractLifecycleComponent.close(AbstractLifecycleComponent.java:103)
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61)
	at org.elasticsearch.node.NodeService.close(NodeService.java:135)
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:85)
	at org.elasticsearch.node.Node.close(Node.java:847)
	at org.elasticsearch.test.InternalTestCluster$NodeAndClient.close(InternalTestCluster.java:990)
	at org.elasticsearch.core.internal.io.IOUtils.closeWhileHandlingException(IOUtils.java:145)
	at org.elasticsearch.test.InternalTestCluster.close(InternalTestCluster.java:818)
	at org.elasticsearch.test.ESIntegTestCase.afterInternal(ESIntegTestCase.java:586)
	at org.elasticsearch.test.ESIntegTestCase.cleanUpCluster(ESIntegTestCase.java:2225)

Then the suite fails with a thread leak error.
thread-leak.txt

Does not reproduce:

./gradlew :x-pack:plugin:ml:unitTest \
  -Dtests.seed=374793047E6CA339 \
  -Dtests.class=org.elasticsearch.license.MachineLearningLicensingTests \
  -Dtests.method="testMachineLearningStopDatafeedActionNotRestricted" \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-SD \
  -Dtests.timezone=BET \
  -Dcompiler.java=11 \
  -Druntime.java=8

./gradlew :x-pack:plugin:ml:unitTest \
  -Dtests.seed=B805872D5C817758 \
  -Dtests.class=org.elasticsearch.license.MachineLearningLicensingTests \
  -Dtests.method="testAutoCloseJobWithDatafeed" \
  -Dtests.security.manager=true \
  -Dtests.locale=zh-Hans-HK \
  -Dtests.timezone=Etc/GMT-5 \
  -Dcompiler.java=11 \
  -Druntime.java=11

./gradlew :x-pack:plugin:ml:unitTest \
  -Dtests.seed=CB48E860FD3AC07E \
  -Dtests.class=org.elasticsearch.license.MachineLearningLicensingTests \
  -Dtests.method="testAutoCloseJobWithDatafeed" \
  -Dtests.security.manager=true \
  -Dtests.locale=de-GR \
  -Dtests.timezone=America/Rankin_Inlet \
  -Dcompiler.java=11 \
  -Druntime.java=8
@davidkyle davidkyle added >test-failure Triaged test failures from CI :ml Machine learning labels Jan 3, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@droberts195
Copy link
Contributor

The assertion that's failing is

Can anyone from @elastic/es-search offer any advice on what might cause this? Is it because ML is doing some operation on the index while it's being closed?

@droberts195
Copy link
Contributor

One more thing is that many failures of this type seem to be running in the Zulu JVM.

However, I found one that isn't: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=opensuse/165/consoleText

So it may be that there's something about the Zulu JVM that makes these failures more likely to occur, but it's not the root cause of the problem.

@alpar-t
Copy link
Contributor

alpar-t commented Jan 9, 2019

@davidkyle
Copy link
Member Author

davidkyle commented Jan 10, 2019

There was a failure in BasicDistributedJobsIT that looks like it has the same root cause

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=java8,nodes=virtual&&linux/172/console

java.lang.AssertionError: 2
	at __randomizedtesting.SeedInfo.seed([3EDB4FBFCECABCD4:D16E134409CDA259]:0)
	at org.elasticsearch.indices.IndicesQueryCache.close(IndicesQueryCache.java:189)
	at org.elasticsearch.core.internal.io.IOUtils.closeWhileHandlingException(IOUtils.java:145)
	at org.elasticsearch.core.internal.io.IOUtils.closeWhileHandlingException(IOUtils.java:130)
	at org.elasticsearch.indices.IndicesService.doClose(IndicesService.java:297)
	at org.elasticsearch.common.component.AbstractLifecycleComponent.close(AbstractLifecycleComponent.java:105)
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61)
	at org.elasticsearch.node.NodeService.close(NodeService.java:135)
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:85)
	at org.elasticsearch.node.Node.close(Node.java:946)
	at org.elasticsearch.test.InternalTestCluster$NodeAndClient.close(InternalTestCluster.java:1008)
	at org.elasticsearch.test.InternalTestCluster.stopNodesAndClients(InternalTestCluster.java:1651)
	at org.elasticsearch.test.InternalTestCluster.stopNodesAndClient(InternalTestCluster.java:1637)
	at org.elasticsearch.test.InternalTestCluster.stopRandomDataNode(InternalTestCluster.java:1531)
	at org.elasticsearch.xpack.ml.integration.BasicDistributedJobsIT.testFailOverBasics_withDataFeeder(BasicDistributedJobsIT.java:121)

In this test a node is shutdown tripping the assertion which doesn't kill the node but leaves it half running. Later there is the error message from the shutdown node:

  1> [2019-01-10T22:27:02,402][WARN ][o.e.t.TcpTransport       ] [node_t1] send message failed [channel: MockChannel{profile='default', isOpen=false, localAddress=/127.0.0.1:39548, isServerSocket=false}]
  1> java.net.SocketException: Socket is closed
  1> 	at java.net.Socket.getOutputStream(Socket.java:943) ~[?:1.8.0_192]
  1> 	at org.elasticsearch.transport.MockTcpTransport$MockChannel.sendMessage(MockTcpTransport.java:437) [framework-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.transport.TcpTransport.internalSendMessage(TcpTransport.java:758) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:844) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:815) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:64) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.action.support.HandledTransportAction$ChannelActionListener.onResponse(HandledTransportAction.java:103) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.action.support.HandledTransportAction$ChannelActionListener.onResponse(HandledTransportAction.java:87) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:360) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:356) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.search.SearchService$4.doRun(SearchService.java:1116) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]

The working theory is that some request hits the node during shutdown I've added some debug logging around the likely candidates in 92b519b

@davidkyle
Copy link
Member Author

@jpountz could you take a look at the assertion failures in IndicesQueryCache.close thrown up by this test please. Any insights you have into what the cause may be are much appreciated.

java.lang.AssertionError: {org.apache.lucene.index.IndexReader$CacheKey@b62bec8=org.elasticsearch.indices.IndicesQueryCache$StatsAndCount@30fd04c3}
	at __randomizedtesting.SeedInfo.seed([374793047E6CA339:C9896ED516320174]:0)
	at org.elasticsearch.indices.IndicesQueryCache.close(IndicesQueryCache.java:191)

and

java.lang.AssertionError: 2
	at __randomizedtesting.SeedInfo.seed([3EDB4FBFCECABCD4:D16E134409CDA259]:0)
	at org.elasticsearch.indices.IndicesQueryCache.close(IndicesQueryCache.java:189)

@cbuescher
Copy link
Member

Talking to @droberts195, he thinks these failures from today look related. Its a different test though:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=amazon/223/console

 ./gradlew :x-pack:plugin:ml:internalClusterTest \
  -Dtests.seed=2D587A9A8B6E70D8 \
  -Dtests.class=org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT \
  -Dtests.method="testFullClusterRestart" \
  -Dtests.security.manager=true \
  -Dtests.locale=es-PE \
  -Dtests.timezone=America/St_Thomas \
  -Dcompiler.java=11 \
  -Druntime.java=8

The logs contain similar failures:

13:53:28 ERROR   12.1s J1 | MlDistributedFailureIT.testFullClusterRestart <<< FAILURES!
13:53:28    > Throwable #1: java.lang.AssertionError: {org.apache.lucene.index.IndexReader$CacheKey@714f684a=org.elasticsearch.indices.IndicesQueryCache$StatsAndCount@3d9dc3df, org.apache.lucene.index.IndexReader$CacheKey@1875360d=org.elasticsearch.indices.IndicesQueryCache$StatsAndCount@46a79038}
13:53:28    > 	at org.elasticsearch.indices.IndicesQueryCache.close(IndicesQueryCache.java:191)
13:53:28    > 	at org.elasticsearch.core.internal.io.IOUtils.closeWhileHandlingException(IOUtils.java:145)
13:53:28    > 	at org.elasticsearch.core.internal.io.IOUtils.closeWhileHandlingException(IOUtils.java:130)
13:53:28    > 	at org.elasticsearch.indices.IndicesService.doClose(IndicesService.java:284)
13:53:28    > 	at org.elasticsearch.common.component.AbstractLifecycleComponent.close(AbstractLifecycleComponent.java:100)
13:53:28    > 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
13:53:28    > 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61)
13:53:28    > 	at org.elasticsearch.node.NodeService.close(NodeService.java:135)
13:53:28    > 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
13:53:28    > 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:85)
13:53:28    > 	at org.elasticsearch.node.Node.close(Node.java:857)
13:53:28    > 	at org.elasticsearch.test.InternalTestCluster$NodeAndClient.close(InternalTestCluster.java:1039)
13:53:28    > 	at org.elasticsearch.test.InternalTestCluster$NodeAndClient.closeForRestart(InternalTestCluster.java:961)
13:53:28    > 	at org.elasticsearch.test.InternalTestCluster.fullRestart(InternalTestCluster.java:1860)
13:53:28    > 	at org.elasticsearch.test.InternalTestCluster.fullRestart(InternalTestCluster.java:1745)
13:53:28    > 	at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$testFullClusterRestart$5(MlDistributedFailureIT.java:113)
13:53:28    > 	at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.run(MlDistributedFailureIT.java:293)

@droberts195 droberts195 changed the title [CI][ML] MachineLearningLicensingTests failures [CI] IndicesQueryCache.close assertion failure in internal cluster tests Feb 4, 2019
@droberts195 droberts195 added the :Search/Search Search-related issues that do not fall into other categories label Feb 4, 2019
@droberts195
Copy link
Contributor

I changed the title of the issue because the things that all these failures have in common are:

  1. assert stats2.isEmpty() : stats2; fails in IndicesQueryCache.close
  2. They're all using the internal cluster test framework

The stats2 variable and the assertion were added in 38f5cc2. The code then underwent a major refactor in 44c653f. The locking was changed but the assertion in close stayed the same. Could it be a problem with visibility of changes between threads that only manifests itself in the internal cluster tests?

@droberts195
Copy link
Contributor

@dakrone
Copy link
Member

dakrone commented Feb 8, 2019

I spent all day trying to reproduce this, but I was unable to.

The only thing I could think of that would cause it was related to what @droberts195 says about thread visibility, specifically maybe making this:

be volatile since that is the determinant of whether stats are removed from the stats2 map; but that is a shot in the dark and I don't have a way to test it since I haven't been able to reproduce this.

I'm curious if anyone else has any ideas.

@droberts195
Copy link
Contributor

droberts195 commented Feb 11, 2019

Thanks for looking at this @dakrone. I agree that int count in StatsAndCount not being volatile is a likely cause of the problem.

I just had a look at the base class, LRUQueryCache in Lucene, and it has the following comment:

  // these variables are volatile so that we do not need to sync reads
  // but increments need to be performed under the lock

So I guess the base class either had the same problem at one point or else its original author realised the potential problem with visibility of changes.

Out of all the thousands of CI builds that run we only seem to get a failure of this assertion about twice a week. None of the 3 tests that have failed because of it have been muted, so it seems like the best way forward is just to make that int count volatile and see if the failures go away.

One more thing though. Can somebody who is more familiar with this confirm that ElasticsearchLRUQueryCache.onDocIdSetCache and ElasticsearchLRUQueryCache.onDocIdSetEviction can never get called by two different threads simultaneously? I guess that's covered by the // but increments need to be performed under the lock in the base class, but if not then there's a worse synchronization problem.

@romseygeek
Copy link
Contributor

Can somebody who is more familiar with this confirm that ElasticsearchLRUQueryCache.onDocIdSetCache and ElasticsearchLRUQueryCache.onDocIdSetEviction can never get called by two different threads simultaneously?

The super methods of both assert that the current thread holds the cache's lock, so I think the behaviour here is fine.

+1 to making StatsAndCount.count volatile

droberts195 added a commit to droberts195/elasticsearch that referenced this issue Feb 11, 2019
The assertion that the stats2 map is empty in
IndicesQueryCache.close has been observed to
fail very occasionally in internal cluster tests.

The likely cause is a cross-thread visibility
problem for a count variable.  This change
makes that count volatile.

Relates elastic#37117
@droberts195
Copy link
Contributor

+1 to making StatsAndCount.count volatile

I opened #38714 to do this

droberts195 added a commit that referenced this issue Feb 11, 2019
The assertion that the stats2 map is empty in
IndicesQueryCache.close has been observed to
fail very occasionally in internal cluster tests.

The likely cause is a cross-thread visibility
problem for a count variable.  This change
makes that count volatile.

Relates #37117
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Feb 11, 2019
)

The assertion that the stats2 map is empty in
IndicesQueryCache.close has been observed to
fail very occasionally in internal cluster tests.

The likely cause is a cross-thread visibility
problem for a count variable.  This change
makes that count volatile.

Relates elastic#37117
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Feb 11, 2019
)

The assertion that the stats2 map is empty in
IndicesQueryCache.close has been observed to
fail very occasionally in internal cluster tests.

The likely cause is a cross-thread visibility
problem for a count variable.  This change
makes that count volatile.

Relates elastic#37117
droberts195 added a commit that referenced this issue Feb 11, 2019
The assertion that the stats2 map is empty in
IndicesQueryCache.close has been observed to
fail very occasionally in internal cluster tests.

The likely cause is a cross-thread visibility
problem for a count variable.  This change
makes that count volatile.

Relates #37117
Backport of #38714
droberts195 added a commit that referenced this issue Feb 11, 2019
The assertion that the stats2 map is empty in
IndicesQueryCache.close has been observed to
fail very occasionally in internal cluster tests.

The likely cause is a cross-thread visibility
problem for a count variable.  This change
makes that count volatile.

Relates #37117
Backport of #38714
droberts195 added a commit that referenced this issue Feb 11, 2019
The assertion that the stats2 map is empty in
IndicesQueryCache.close has been observed to
fail very occasionally in internal cluster tests.

The likely cause is a cross-thread visibility
problem for a count variable.  This change
makes that count volatile.

Relates #37117
Backport of #38714
@dakrone
Copy link
Member

dakrone commented Feb 11, 2019

@droberts195 thanks for opening the PR, I'm going to close this and assume your fix worked (so we don't forget to resolve this). We can always re-open it if we see more of this failure.

@jpountz
Copy link
Contributor

jpountz commented Feb 19, 2019

No worries at all @droberts195. I feel like we have an actual bug that the current behavior of MlLifkeCycleService is only more likely to expose. +1 to stop firing new requests in stop() though.

droberts195 added a commit to droberts195/elasticsearch that referenced this issue Feb 19, 2019
The ML memory tracker does searches against ML results
and config indices.  These searches can be asynchronous,
and if they are running while the node is closing then
they can cause problems for other components.

This change adds a stop() method to the MlMemoryTracker
that waits for in-flight searches to complete.  Once
stop() has returned the MlMemoryTracker will not kick
off any new searches.

The MlLifeCycleService now calls MlMemoryTracker.stop()
before stopping stopping the node.

Fixes elastic#37117
droberts195 added a commit that referenced this issue Feb 19, 2019
The ML memory tracker does searches against ML results
and config indices.  These searches can be asynchronous,
and if they are running while the node is closing then
they can cause problems for other components.

This change adds a stop() method to the MlMemoryTracker
that waits for in-flight searches to complete.  Once
stop() has returned the MlMemoryTracker will not kick
off any new searches.

The MlLifeCycleService now calls MlMemoryTracker.stop()
before stopping stopping the node.

Fixes #37117
droberts195 added a commit that referenced this issue Feb 19, 2019
The ML memory tracker does searches against ML results
and config indices.  These searches can be asynchronous,
and if they are running while the node is closing then
they can cause problems for other components.

This change adds a stop() method to the MlMemoryTracker
that waits for in-flight searches to complete.  Once
stop() has returned the MlMemoryTracker will not kick
off any new searches.

The MlLifeCycleService now calls MlMemoryTracker.stop()
before stopping stopping the node.

Fixes #37117
droberts195 added a commit that referenced this issue Feb 19, 2019
The ML memory tracker does searches against ML results
and config indices.  These searches can be asynchronous,
and if they are running while the node is closing then
they can cause problems for other components.

This change adds a stop() method to the MlMemoryTracker
that waits for in-flight searches to complete.  Once
stop() has returned the MlMemoryTracker will not kick
off any new searches.

The MlLifeCycleService now calls MlMemoryTracker.stop()
before stopping stopping the node.

Fixes #37117
droberts195 added a commit that referenced this issue Feb 19, 2019
The ML memory tracker does searches against ML results
and config indices.  These searches can be asynchronous,
and if they are running while the node is closing then
they can cause problems for other components.

This change adds a stop() method to the MlMemoryTracker
that waits for in-flight searches to complete.  Once
stop() has returned the MlMemoryTracker will not kick
off any new searches.

The MlLifeCycleService now calls MlMemoryTracker.stop()
before stopping stopping the node.

Fixes #37117
droberts195 added a commit that referenced this issue Feb 19, 2019
The ML memory tracker does searches against ML results
and config indices.  These searches can be asynchronous,
and if they are running while the node is closing then
they can cause problems for other components.

This change adds a stop() method to the MlMemoryTracker
that waits for in-flight searches to complete.  Once
stop() has returned the MlMemoryTracker will not kick
off any new searches.

The MlLifeCycleService now calls MlMemoryTracker.stop()
before stopping stopping the node.

Fixes #37117
@droberts195
Copy link
Contributor

This happened again in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+internalClusterTest/1179/consoleText

This included my change to stop the ML memory tracker as part of Node.stop before components get closed.

The exception has moved on from previous occurrences - now it says no master node:

  1> [2019-02-24T15:07:04,458][INFO ][o.e.x.m.i.MlDistributedFailureIT] [testFullClusterRestart] after test
ERROR   11.4s J1 | MlDistributedFailureIT.testFullClusterRestart <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: {org.apache.lucene.index.IndexReader$CacheKey@533b697c=org.elasticsearch.indices.IndicesQueryCache$StatsAndCount@31412048}
   > 	at org.elasticsearch.indices.IndicesQueryCache.close(IndicesQueryCache.java:191)
   > 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
   > 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61)
   > 	at org.elasticsearch.indices.IndicesService$2.closeInternal(IndicesService.java:265)
   > 	at org.elasticsearch.common.util.concurrent.AbstractRefCounted.decRef(AbstractRefCounted.java:63)
   > 	at org.elasticsearch.indices.IndicesService.doClose(IndicesService.java:309)
   > 	at org.elasticsearch.common.component.AbstractLifecycleComponent.close(AbstractLifecycleComponent.java:100)
   > 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
   > 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61)
   > 	at org.elasticsearch.node.NodeService.close(NodeService.java:135)
   > 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
   > 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:85)
   > 	at org.elasticsearch.node.Node.close(Node.java:857)
   > 	at org.elasticsearch.test.InternalTestCluster$NodeAndClient.close(InternalTestCluster.java:1051)
   > 	at org.elasticsearch.test.InternalTestCluster$NodeAndClient.closeForRestart(InternalTestCluster.java:972)
   > 	at org.elasticsearch.test.InternalTestCluster.fullRestart(InternalTestCluster.java:1892)
   > 	at org.elasticsearch.test.InternalTestCluster.fullRestart(InternalTestCluster.java:1765)
   > 	at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$testFullClusterRestart$5(MlDistributedFailureIT.java:113)
   > 	at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.run(MlDistributedFailureIT.java:286)
   > 	at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testFullClusterRestart(MlDistributedFailureIT.java:111)
   > 	at java.lang.Thread.run(Thread.java:748)Throwable #2: java.lang.RuntimeException: Had to resort to force-stopping datafeed, something went wrong?
   > 	at org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase.deleteAllDatafeeds(BaseMlIntegTestCase.java:294)
   > 	at org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase.cleanupWorkaround(BaseMlIntegTestCase.java:207)
   > 	at java.lang.Thread.run(Thread.java:748)
   > Caused by: java.util.concurrent.ExecutionException: MasterNotDiscoveredException[no known master node]
   > 	at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:266)
   > 	at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:253)
   > 	at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87)
   > 	at org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase.deleteAllDatafeeds(BaseMlIntegTestCase.java:282)
   > 	... 36 more
   > Caused by: MasterNotDiscoveredException[no known master node]
   > 	at org.elasticsearch.xpack.ml.action.TransportStopDatafeedAction.doExecute(TransportStopDatafeedAction.java:108)
   > 	at org.elasticsearch.xpack.ml.action.TransportStopDatafeedAction.doExecute(TransportStopDatafeedAction.java:43)
   > 	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:145)
   > 	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:121)
   > 	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:64)
   > 	at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:83)
   > 	at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:72)
   > 	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:393)
   > 	at org.elasticsearch.client.FilterClient.doExecute(FilterClient.java:65)
   > 	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:393)
   > 	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:382)
   > 	at org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase.deleteAllDatafeeds(BaseMlIntegTestCase.java:281)
   > 	... 36 moreThrowable #3: java.lang.RuntimeException: already closed
   > 	at org.elasticsearch.test.InternalTestCluster$NodeAndClient.getOrBuildNodeClient(InternalTestCluster.java:925)
   > 	at org.elasticsearch.test.InternalTestCluster$NodeAndClient.client(InternalTestCluster.java:904)
   > 	at org.elasticsearch.test.InternalTestCluster$2.next(InternalTestCluster.java:2338)
   > 	at org.elasticsearch.test.InternalTestCluster$2.next(InternalTestCluster.java:2329)
   > 	at org.elasticsearch.test.ESIntegTestCase.ensureClusterStateConsistency(ESIntegTestCase.java:1134)
   > 	at org.elasticsearch.test.ESIntegTestCase.afterInternal(ESIntegTestCase.java:572)
   > 	at org.elasticsearch.test.ESIntegTestCase.cleanUpCluster(ESIntegTestCase.java:2191)
   > 	at java.lang.Thread.run(Thread.java:748)

The log confirms that the ML memory tracker is stopped before this happens:

  1> [2019-02-24T15:07:17,363][DEBUG][o.e.x.m.p.MlMemoryTracker] [testJobRelocationIsMemoryAware] ML memory tracker stopped

@droberts195 droberts195 reopened this Feb 25, 2019
@markharwood
Copy link
Contributor

markharwood commented Feb 25, 2019

build-stats link

gwbrown added a commit that referenced this issue Mar 1, 2019
This test is already muted in master (see #37117), backporting 
this mute to 7.0 as it's failing there as well.
jpountz added a commit to jpountz/elasticsearch that referenced this issue Mar 4, 2019
This commit adds the following:
 - more tests to IndicesServiceCloseTests, one of them found a bug in the order
   in which `IndicesQueryCache#onClose` and
   `IndicesService.indicesRefCount#decRef` are called.
 - made `IndicesQueryCache.stats2` a synchronized map. All writes to it are
   already protected by the lock of the Lucene cache, but the final read from
   an assertion in `IndicesQueryCache#close()` was not so this change should
   avoid any potential visibility issues.
 - human-readable `toString`s to make debugging easier.

Relates elastic#37117
jpountz added a commit that referenced this issue Mar 4, 2019
This commit adds the following:
 - more tests to IndicesServiceCloseTests, one of them found a bug in the order
   in which `IndicesQueryCache#onClose` and
   `IndicesService.indicesRefCount#decRef` are called.
 - made `IndicesQueryCache.stats2` a synchronized map. All writes to it are
   already protected by the lock of the Lucene cache, but the final read from
   an assertion in `IndicesQueryCache#close()` was not so this change should
   avoid any potential visibility issues.
 - human-readable `toString`s to make debugging easier.

Relates #37117
jpountz added a commit that referenced this issue Mar 4, 2019
This commit adds the following:
 - more tests to IndicesServiceCloseTests, one of them found a bug in the order
   in which `IndicesQueryCache#onClose` and
   `IndicesService.indicesRefCount#decRef` are called.
 - made `IndicesQueryCache.stats2` a synchronized map. All writes to it are
   already protected by the lock of the Lucene cache, but the final read from
   an assertion in `IndicesQueryCache#close()` was not so this change should
   avoid any potential visibility issues.
 - human-readable `toString`s to make debugging easier.

Relates #37117
@droberts195
Copy link
Contributor

Since the fixes of #39099 21 days ago the build stats link only shows one failure on 7.x and above. (It has failed a few more times on 6.6 and 6.7, but they don't have that fix.)

That one failure was in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=java8fips,nodes=immutable&&linux&&docker/82/, which is on a FIPS JVM and also a number of other tests related to splitting indices also failed, so there may be some interaction there.

So it's looking like a combination of #39111 and #39099 have fixed the problems that lead to this issue being raised. #39111 is in 6.6.2 and above; #39099 is in 7.1.0 and above.

jpountz added a commit to jpountz/elasticsearch that referenced this issue Mar 26, 2019
This commit adds the following:
 - more tests to IndicesServiceCloseTests, one of them found a bug in the order
   in which `IndicesQueryCache#onClose` and
   `IndicesService.indicesRefCount#decRef` are called.
 - made `IndicesQueryCache.stats2` a synchronized map. All writes to it are
   already protected by the lock of the Lucene cache, but the final read from
   an assertion in `IndicesQueryCache#close()` was not so this change should
   avoid any potential visibility issues.
 - human-readable `toString`s to make debugging easier.

Relates elastic#37117
jpountz added a commit to jpountz/elasticsearch that referenced this issue Mar 26, 2019
This commit adds the following:
 - made `IndicesQueryCache.stats2` a synchronized map. All writes to it are
   already protected by the lock of the Lucene cache, but the final read from
   an assertion in `IndicesQueryCache#close()` was not so this change should
   avoid any potential visibility issues.
 - human-readable `toString`s to make debugging easier.

Relates elastic#37117
jpountz added a commit that referenced this issue Mar 26, 2019
This commit adds the following:
 - more tests to IndicesServiceCloseTests, one of them found a bug in the order
   in which `IndicesQueryCache#onClose` and
   `IndicesService.indicesRefCount#decRef` are called.
 - made `IndicesQueryCache.stats2` a synchronized map. All writes to it are
   already protected by the lock of the Lucene cache, but the final read from
   an assertion in `IndicesQueryCache#close()` was not so this change should
   avoid any potential visibility issues.
 - human-readable `toString`s to make debugging easier.

Relates #37117
jpountz added a commit that referenced this issue Mar 26, 2019
This commit adds the following:
 - made `IndicesQueryCache.stats2` a synchronized map. All writes to it are
   already protected by the lock of the Lucene cache, but the final read from
   an assertion in `IndicesQueryCache#close()` was not so this change should
   avoid any potential visibility issues.
 - human-readable `toString`s to make debugging easier.

Relates #37117
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this issue May 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning :Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI
Projects
None yet