Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] S3BlobStoreRepositoryTests testSnapshotAndRestore failure #52125

Closed
tvernum opened this issue Feb 10, 2020 · 4 comments
Closed

[CI] S3BlobStoreRepositoryTests testSnapshotAndRestore failure #52125

tvernum opened this issue Feb 10, 2020 · 4 comments
Assignees
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI

Comments

@tvernum
Copy link
Contributor

tvernum commented Feb 10, 2020

org.elasticsearch.repositories.s3.S3BlobStoreRepositoryTests > testSnapshotAndRestore FAILED	
    java.lang.AssertionError: timed out waiting for green state	
        at org.junit.Assert.fail(Assert.java:88)	
        at org.elasticsearch.test.ESIntegTestCase.ensureColor(ESIntegTestCase.java:883)	
        at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:836)	
        at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:825)	
        at org.elasticsearch.repositories.blobstore.ESBlobStoreRepositoryIntegTestCase.testSnapshotAndRestore(ESBlobStoreRepositoryIntegTestCase.java:324)	
 1> [2020-02-10T09:19:50,112][INFO ][o.e.t.s.M.Listener       ] [node_sd2] [vwwokq][3] end check index	
  1> [2020-02-10T09:19:55,880][INFO ][o.e.c.r.a.AllocationService] [node_sm0] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[p][0]]]).	
  1> [2020-02-10T09:19:56,153][INFO ][o.e.r.s.S3BlobStoreRepositoryTests] [testSnapshotAndRestore] --> restore all indices from the snapshot	
  1> [2020-02-10T09:19:57,760][INFO ][o.e.t.s.M.Listener       ] [node_sd1] [vwwokq][6] start check index	
  1> [2020-02-10T09:19:57,772][INFO ][o.e.t.s.M.Listener       ] [node_sd2] [vwwokq][7] start check index	
  1> [2020-02-10T09:19:57,889][INFO ][o.e.t.s.M.Listener       ] [node_sd1] [vwwokq][6] end check index
...
1> [2020-02-10T09:19:59,242][INFO ][o.e.t.s.M.Listener       ] [node_sd1] [p][4] start check index	
  1> [2020-02-10T09:19:59,299][INFO ][o.e.t.s.M.Listener       ] [node_sd1] [p][4] end check index	
  1> [2020-02-10T09:20:19,251][WARN ][o.e.c.r.a.AllocationService] [node_sm0] [p][0] marking unavailable shards as stale: [0AEdY9NsS0GtNU-WDiRqlw]	
  1> [2020-02-10T09:20:27,549][WARN ][o.e.c.r.a.AllocationService] [node_sm0] [p][7] marking unavailable shards as stale: [lB1-y8RBQ_-_95ZE8rKUmA]	
  1> [2020-02-10T09:20:41,014][WARN ][o.e.c.r.a.AllocationService] [node_sm0] [p][4] marking unavailable shards as stale: [IneAgQOJTVGmQYVAAZyAKA]	
  1> [2020-02-10T09:20:41,527][WARN ][o.e.c.r.a.AllocationService] [node_sm0] [p][5] marking unavailable shards as stale: [080jm4sER3CKFMTOWjOb6A]	
  1> [2020-02-10T09:20:44,484][WARN ][o.e.c.r.a.AllocationService] [node_sm0] [vwwokq][0] marking unavailable shards as stale: [x1tCQJqsSuONqS3OEfrIqw]	
  1> [2020-02-10T09:20:47,387][WARN ][o.e.c.r.a.AllocationService] [node_sm0] [vwwokq][2] marking unavailable shards as stale: [Q7wchP7OSYa04Iom8JomXA]	
  1> [2020-02-10T09:20:48,989][WARN ][o.e.c.r.a.AllocationService] [node_sm0] [vwwokq][3] marking unavailable shards as stale: [_rkzSMjHQDiiGNxbEAyfAw]	
  1> [2020-02-10T09:20:48,990][WARN ][o.e.c.r.a.AllocationService] [node_sm0] [vwwokq][1] marking unavailable shards as stale: [6h1uNCewTuyiczQHYc6V_Q]	
  1> [2020-02-10T09:20:49,939][WARN ][o.e.c.r.a.AllocationService] [node_sm0] [vwwokq][4] marking unavailable shards as stale: [ZgDfuuTGT4W9eRLfZdGRFg]	
  1> [2020-02-10T09:20:50,198][WARN ][o.e.c.r.a.AllocationService] [node_sm0] [vwwokq][5] marking unavailable shards as stale: [-mA8LzsrRUGJR1Wq3Ukq9A]	
  1> [2020-02-10T09:20:53,410][WARN ][o.e.c.r.a.AllocationService] [node_sm0] [vwwokq][6] marking unavailable shards as stale: [3NI3i0igT6WxoeFHfBUrJA]	
  1> [2020-02-10T09:20:57,016][WARN ][o.e.c.r.a.AllocationService] [node_sm0] [vwwokq][8] marking unavailable shards as stale: [wGRPrm8YRjaRGqrp7t2nYQ]	
  1> [2020-02-10T09:20:57,822][WARN ][o.e.c.r.a.AllocationService] [node_sm0] [vwwokq][7] marking unavailable shards as stale: [fkdCveOORTiYJLSn-8ZuOQ]	
  1> [2020-02-10T09:21:16,496][WARN ][o.e.c.r.a.AllocationService] [node_sm0] [p][1] marking unavailable shards as stale: [cqM4WD1-T5mXoTTji0d62g]	
  1> [2020-02-10T09:21:18,494][INFO ][o.e.r.s.S3BlobStoreRepositoryTests] [testSnapshotAndRestore] ensureGreen timed out, cluster state:	
  1> cluster uuid: 4Nzhw1AhSRSX2eAPpUI2kA [committed: true]
@tvernum tvernum added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Feb 10, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@original-brownbear original-brownbear self-assigned this Feb 10, 2020
@original-brownbear
Copy link
Member

This looks like a randomly super slow CI run. We have some logging about slow IO operations on the IO loop like e.g.:

  1> [2020-02-10T09:18:23,381][WARN ][o.e.t.n.MockNioTransport ] [node_sd1] Slow execution on network thread [102 milliseconds]
  1> java.lang.RuntimeException: Slow exception on network thread
  1> 	at org.elasticsearch.transport.nio.MockNioTransport$TransportThreadWatchdog.maybeLogElapsedTime(MockNioTransport.java:379) [framework-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.transport.nio.MockNioTransport$TransportThreadWatchdog.unregister(MockNioTransport.java:371) [framework-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.transport.nio.TestEventHandler.handleRead(TestEventHandler.java:154) [framework-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.handleRead(NioSelector.java:420) [elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:246) [elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:174) [elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:131) [elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at java.lang.Thread.run(Thread.java:832) [?:?]
  1> [2020-02-10T09:18:23,387][WARN ][o.e.t.n.MockNioTransport ] [node_sd2] Slow execution on network thread [102 milliseconds]
  1> java.lang.RuntimeException: Slow exception on network thread
  1> 	at org.elasticsearch.transport.nio.MockNioTransport$TransportThreadWatchdog.maybeLogElapsedTime(MockNioTransport.java:379) [framework-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.transport.nio.MockNioTransport$TransportThreadWatchdog.unregister(MockNioTransport.java:371) [framework-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.transport.nio.TestEventHandler.handleRead(TestEventHandler.java:154) [framework-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.handleRead(NioSelector.java:420) [elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:246) [elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:174) [elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:131) [elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at java.lang.Thread.run(Thread.java:832) [?:?]
  1> [2020-02-10T09:18:23,388][WARN ][o.e.t.n.MockNioTransport ] [node_sd1] Slow execution on network thread [100 milliseconds]
  1> java.lang.RuntimeException: Slow exception on network thread
  1> 	at org.elasticsearch.transport.nio.MockNioTransport$TransportThreadWatchdog.maybeLogElapsedTime(MockNioTransport.java:379) [framework-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.transport.nio.MockNioTransport$TransportThreadWatchdog.unregister(MockNioTransport.java:371) [framework-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.transport.nio.TestEventHandler.handleRead(TestEventHandler.java:154) [framework-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.handleRead(NioSelector.java:420) [elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:246) [elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:174) [elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:131) [elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at java.lang.Thread.run(Thread.java:832) [?:?]

and then the test doesn't really fail on any snapshot operation but simply fails to replicate just restored in indices in time (looking at the hot threads dump, it's still actively recovering when the green assertion fails) => nothing we can do here, this seems to be an infra issue.

@williamrandolph
Copy link
Contributor

I don't believe we should reopen this issue, but I want to note for the record that this still seems to be happening: https://gradle-enterprise.elastic.co/s/3mj7lhs3dztds

@original-brownbear
Copy link
Member

Annoyingly enough all the failures listed here have WindowsFS in common. See #52000 ... I wonder if snapshotting as an action somehow stresses WindowsFS more that much more than other tests that it repeatedly triggers this slowness. I'll look into that if it comes up too often.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

4 participants