Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Getting DeadlineTimeoutException error during security tests after upgrading to 3.0 #580

Closed
qreshi opened this issue Dec 14, 2022 · 16 comments · Fixed by #583
Closed
Labels
bug Something isn't working untriaged v3.0.0

Comments

@qreshi
Copy link
Contributor

qreshi commented Dec 14, 2022

What is the bug?
With the upgrade to 3.0 in #556, the dependency on org.apache.httpcomponents:httpclient and org.apache.httpcomponents:httpcore changed to org.apache.httpcomponents.client5:httpclient5 and org.apache.httpcomponents.core5:httpcore5 respectively (these are coming from OpenSearch core).

After this upgrade we're starting to see some of the tests fail during the security test CI workflow with the following error:

REPRODUCE WITH: ./gradlew ':notifications:integTest' --tests "org.opensearch.integtest.SecurityNotificationIT.test get slack notification config without get Notification permission" -Dtests.seed=C977BA49C8625070 -Dtests.security.manager=false -Dtests.locale=be -Dtests.timezone=Asia/Yakutsk -Druntime.java=17

org.opensearch.integtest.SecurityNotificationIT > test get slack notification config without get Notification permission FAILED
    java.lang.RuntimeException: error while performing request
        at __randomizedtesting.SeedInfo.seed([C977BA49C8625070:44430BFE7A23F02]:0)
        at org.opensearch.client.RestClient.extractAndWrapCause(RestClient.java:1045)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:341)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:329)
        at org.opensearch.integtest.PluginRestTestCase.createUserRolesMapping(PluginRestTestCase.kt:185)
        at org.opensearch.integtest.PluginRestTestCase.createUserWithCustomRole(PluginRestTestCase.kt:256)
        at org.opensearch.integtest.SecurityNotificationIT.test get slack notification config without get Notification permission(SecurityNotificationIT.kt:304)

        Caused by:
        org.apache.hc.core5.util.DeadlineTimeoutException: Deadline: 292278994-08-17T16:12:55.807+0900, 9223370365877384600 MILLISECONDS overdue
            at app//org.apache.hc.core5.util.DeadlineTimeoutException.from(DeadlineTimeoutException.java:49)
            at app//org.apache.hc.core5.pool.StrictConnPool.lease(StrictConnPool.java:217)
            at app//org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$1.<init>(PoolingAsyncClientConnectionManager.java:231)
            at app//org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager.lease(PoolingAsyncClientConnectionManager.java:227)
            at app//org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.acquireEndpoint(InternalHttpAsyncExecRuntime.java:100)
            at app//org.apache.hc.client5.http.impl.async.AsyncConnectExec.execute(AsyncConnectExec.java:135)
            at app//org.apache.hc.client5.http.impl.async.AsyncExecChainElement.execute(AsyncExecChainElement.java:54)
            at app//org.apache.hc.client5.http.impl.async.AsyncExecChainElement$1.proceed(AsyncExecChainElement.java:62)
            at app//org.apache.hc.client5.http.impl.async.AsyncProtocolExec.internalExecute(AsyncProtocolExec.java:198)
            at app//org.apache.hc.client5.http.impl.async.AsyncProtocolExec.execute(AsyncProtocolExec.java:158)
            at app//org.apache.hc.client5.http.impl.async.AsyncExecChainElement.execute(AsyncExecChainElement.java:54)
            at app//org.apache.hc.client5.http.impl.async.AsyncExecChainElement$1.proceed(AsyncExecChainElement.java:62)
            at app//org.apache.hc.client5.http.impl.async.AsyncRedirectExec.internalExecute(AsyncRedirectExec.java:112)
            at app//org.apache.hc.client5.http.impl.async.AsyncRedirectExec.execute(AsyncRedirectExec.java:278)
            at app//org.apache.hc.client5.http.impl.async.AsyncExecChainElement.execute(AsyncExecChainElement.java:54)
            at app//org.apache.hc.client5.http.impl.async.InternalAbstractHttpAsyncClient.executeImmediate(InternalAbstractHttpAsyncClient.java:368)
            at app//org.apache.hc.client5.http.impl.async.InternalAbstractHttpAsyncClient$1.sendRequest(InternalAbstractHttpAsyncClient.java:224)
            at app//org.apache.hc.core5.http.nio.support.BasicRequestProducer.sendRequest(BasicRequestProducer.java:93)
            at app//org.apache.hc.client5.http.impl.async.InternalAbstractHttpAsyncClient.doExecute(InternalAbstractHttpAsyncClient.java:180)
            at app//org.apache.hc.client5.http.impl.async.CloseableHttpAsyncClient.execute(CloseableHttpAsyncClient.java:97)
            at app//org.apache.hc.client5.http.impl.async.CloseableHttpAsyncClient.execute(CloseableHttpAsyncClient.java:107)
            at app//org.opensearch.client.RestClient.performRequest(RestClient.java:337)
            ... 4 more

Not all of the tests failed in this example workflow that is being referenced:

75 tests completed, 6 failed
Tests with failures:
 - org.opensearch.integtest.SecurityNotificationIT.test delete slack notification config with user that has get Notification permission
 - org.opensearch.integtest.SecurityNotificationIT.test getChannelList should return only channels with get channel permission
 - org.opensearch.integtest.SecurityNotificationIT.test get slack notification config without get Notification permission
 - org.opensearch.integtest.SecurityNotificationIT.test get slack notification config with user that has get Notification permission
 - org.opensearch.integtest.config.DeleteNotificationConfigIT.test Delete single absent notification config should fail
 - org.opensearch.integtest.config.EmailNotificationConfigCrudIT.test Create email notification config without email_group IDs

Given the stacktrace, it seems that the client used to make the calls to the cluster during the integration tests needed to obtain a lease from a connection pool and that lease either took too long or could not be obtained. We should check with the OpenSearch core team if the changes to these httpcomponents related dependencies has changed anything with the connection pool resourcing/timeouts leading to these occasional errors. Strangely enough, this error was only seen during the security test runs in the CI. The regular integration tests did not run into this problem. Either it's flaky or the conditions of the Docker security tests make it more likely to happen.

How can one reproduce the bug?
Steps to reproduce the behavior:
Run the security integration test CI in a PR or locally run the security tests against a Docker image running OpenSearch 3.0 with security

What is the expected behavior?
The client is able to get the lease in time and perform the operation during the tests.

@qreshi qreshi added bug Something isn't working untriaged labels Dec 14, 2022
@saratvemulapalli
Copy link
Member

saratvemulapalli commented Dec 14, 2022

Is this the first run of tests in 3.0?
We have updated Apache HTTP Client in 3.0[1]. This could be related.

[1] opensearch-project/OpenSearch#4459

@qreshi
Copy link
Contributor Author

qreshi commented Dec 14, 2022

@saratvemulapalli The referenced PR is the first run of the tests in 3.0, yes. The PR itself has been open for some time though and we had seen the errors earlier after we bumped.

Do you know if there is known root cause for how that bump has affected the client pool in this way? Or if there is a workaround for getting rid of these test errors?

These errors are the only things blocking us from merging this PR in and we'd like to get main updated as soon as possible to unblock dependent plugins as well.

@qreshi
Copy link
Contributor Author

qreshi commented Dec 14, 2022

@dblock I saw you had reviewed the OpenSearch PR to migrate the client transports. Any thoughts?

@saratvemulapalli
Copy link
Member

saratvemulapalli commented Dec 15, 2022

@reta do you have any suggestions?
(I haven't dug through the code but if this is still blocked later this week I will have some bandwidth).

@reta
Copy link
Contributor

reta commented Dec 15, 2022

@saratvemulapalli @qreshi interesting, I will take a look shortly, but for cause so far for such kind of exceptions was the issues with plugin itself (fe it was not installed properly), hence the timeout.

@reta
Copy link
Contributor

reta commented Dec 15, 2022

Ok, @qreshi I cannot reproduce the issue right now, here is what could have happened: the OpenSearch or / and plugin did not start properly in the Docker container, causing all tests to fail.

@qreshi
Copy link
Contributor Author

qreshi commented Dec 15, 2022

Hey @reta thanks for looking into it. I'm not sure that's the cause because not all the tests failed. If you look at the summary of the failures in the issue description as well, only some of the tests failed.

Even within the SecurityNotificationIT test class, not all of the tests failed. That's what made me think it was due to some resource starvation in the connection pool. Strangely enough, we only seem to be seeing this in our security/Docker workflow test. I haven't seen this error yet in our regular integration test run.

@reta
Copy link
Contributor

reta commented Dec 15, 2022

Hey @reta thanks for looking into it. I'm not sure that's the cause because not all the tests failed. If you look at the summary of the failures in the issue description as well, only some of the tests failed.

Oh, yes, you are right, I looked at JDK-17 first (it has more failures), but JDK-11 has only one test case

@qreshi
Copy link
Contributor Author

qreshi commented Dec 15, 2022

Hey @reta thanks for looking into it. I'm not sure that's the cause because not all the tests failed. If you look at the summary of the failures in the issue description as well, only some of the tests failed.

Oh, yes, you are right, I looked at JDK-17 first (it has more failures), but JDK-11 has only one test case

It's definitely flaky. I re-ran the security workflow for both and this time JDK-17 only had one test failure and JDK-11 had a different set fail.

Is it possible to find out what the connection pool size and lease timeout is that OpenSearch uses? If so, we can see if these values changed.

Also, if it's possible to override those values, we can try making them more lenient and see if the failures go away. Just to rule out causes.

@reta
Copy link
Contributor

reta commented Dec 15, 2022

@qreshi Ok, so one problem is this guy opensearch-project/common-utils#287. Regarding timeouts, SecureRestClientBuilder has methods to set all all of them, for SecurityNotificationIT sets setSocketTimeout:

userClient = SecureRestClientBuilder(clusterHosts.toTypedArray(), isHttps(), user, user).setSocketTimeout(60000).build()

Besides that, there is connectionRequestTimeout and connectTimeout

@reta
Copy link
Contributor

reta commented Dec 15, 2022

@qreshi there is something very fishy here, the DeadlineTimeoutException does not make sense

org.apache.hc.core5.util.DeadlineTimeoutException: Deadline: 292278994-08-17T03:12:55.807-0400, 9223370365715786834 MILLISECONDS overdue

292278994-08-17T03:12:55.807-0400 this is supposed to be timestamp, but it has year component set to 292278994, I will continue looking tomorrow.

@qreshi
Copy link
Contributor Author

qreshi commented Dec 15, 2022

@qreshi there is something very fishy here, the DeadlineTimeoutException does not make sense

org.apache.hc.core5.util.DeadlineTimeoutException: Deadline: 292278994-08-17T03:12:55.807-0400, 9223370365715786834 MILLISECONDS overdue

292278994-08-17T03:12:55.807-0400 this is supposed to be timestamp, but it has year component set to 292278994, I will continue looking tomorrow.

That's a good catch. That millisecond overdue time is ridiculously large as a result.

@reta
Copy link
Contributor

reta commented Dec 16, 2022

@qreshi my apologies, I didn't get time to resume looking into the issue today but I will continue next week

@reta
Copy link
Contributor

reta commented Dec 20, 2022

@qreshi the issue is understood and fixed by #583: the test failures were caused by incorrect SecureRestClientBuilder's defaults related to connectionRequestTimeout: the value was set t o0 which basically meant for strict connection pool to fail immediately if the connection is not available.

@qreshi
Copy link
Contributor Author

qreshi commented Jan 3, 2023

Nice, thanks for looking into this @reta.

So I'm assuming something changed in the newer version of the underlying library that made the strict connection pool not tolerant enough to pass without a larger timeout?

@reta
Copy link
Contributor

reta commented Jan 3, 2023

So I'm assuming something changed in the newer version of the underlying library that made the strict connection pool not tolerant enough to pass without a larger timeout?

The are many changes indeed, this is major version bump

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working untriaged v3.0.0
Projects
None yet
3 participants