-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ReadTimeoutException loading JWK causing exponential resource consumption since 4.5.0 #1633
Comments
Thanks for reporting @etremblay You can disable the use of the Micronaut HTTP Client with : micronaut.security.token.jwt.signatures.jwks-client.http-client.enabled=false @etremblay can you try disabling the use of Micronaut HTTP Client for JWKS and report back? @jeremyg484 make sure we document that setting. @jeremyg484 we need to add tests for retries for the Micronaut Framework HTTP Client. |
this typically occurs when the server and client share the same event loop and there are not enough threads in the pod for both so the event loop deadlocks if you perform a blocking operation. //cc @yawkat |
Hi, @sdelamo thanks for the quick reply. We almost tried to disable the http client like you say, I'm not sure if it is documented but we found it reading the source code. Unfortunately, we opted for the static jwks file that seemed a safer option. We could not affort another production downtime affecting thousands of users. |
@etremblay Just to clarify, setting |
@jeremyg484 From what I've seen, the original implementation was not called inside a reactive Even with the fallack on old implementation, it is still called inside a |
@etremblay I think I have narrowed down the issue, and it is likely a deadlock issue as @graemerocher mentioned. I believe I know the cause and I have a couple of ideas of how to resolve it. But I would like to get a little more info about your particular setup so that I can be sure the changes I have in mind will fix this particular problem. Can you provide more details about how you are configuring JWT and JWKS? I would like to understand what exactly about your configuration and/or code is causing the JWK set to be fetched at startup time of your application. How are the retries happening? Do you have the HttpClient (default or service-specific that is intended to be used for the JWK set fetching) configured for retries, or are you using some other means? |
@jeremyg484 The hint about the event loop and threads made me and my team re-think about the problem in a different angle. First of all, I was mislead when I taught the problem occured during the application boot time. In fact, it occurs when it receive it's first http request protected by micronaut security. When the app is under high pressure, k8s scale up and spawn a new pod (or more). When the application healthcheck is ready, the cluster can redirect many requests at once. The health endpoint itself is not protected by micronaut-security and does not initialize the jwks. We then have a situation with many inbond request waiting for authentication and one outboud request for the jwks probably waiting for the event loop. Moreover, the configured cpu request for the pod is 500m so the node may allocate less than one cpu for the application. Not shure if it would limit the number of thread available for the application but there is a stron smell there. We intend to try and reproduce the issue in a non-production environment ans see if a higher cpu request fix it. I will keep you informed. |
@etremblay That is very useful info, thank you. There is definitely a place in the new JWK set fetching code where we are blocking when we shouldn't be, and I am going to fix that. With the allocation of less than one cpu, it makes sense to me that you would be more likely encounter the deadlock. Allocating more resources might get you past the problem with the current version, but until the fix I have in mind is in place, there will still be a chance that it could occur, just at a later point - we will get that sorted out shortly. |
JwtTokenValidator is modified to explicitly execute JwtValidator.validate in a separate thread from the Netty event loop. JwtValidator.validate is blocking in its current implementation and this allows it to be executed in a non-blocking fashion. This is a short-term fix for #1633, where the preferred long-term fix would be to refactor the JwtValidator API to be non-blocking.
JwtTokenValidator is modified to explicitly execute JwtValidator.validate in a separate thread from the Netty event loop. JwtValidator.validate is blocking in its current implementation and this allows it to be executed in a non-blocking fashion. This is a short-term fix for #1633, where the preferred long-term fix would be to refactor the JwtValidator API to be non-blocking.
JwtTokenValidator is modified to explicitly execute JwtValidator.validate in a separate thread from the Netty event loop. JwtValidator.validate is blocking in its current implementation and this allows it to be executed in a non-blocking fashion. This is a short-term fix for #1633, where the preferred long-term fix would be to refactor the JwtValidator API to be non-blocking.
Thanks ! |
Expected Behavior
When the JWKS endpoint is available, the application load the key information and boot normally.
When the JWKS endpoint is down, the application retry untile the JWKS is up.
Actual Behaviour
Application fail with
We observed that after each attempt, the cpu and memory consumption rise up very fast.
Giving more resource to the pod does not resolve the problem.
The application never recover from this.
Steps To Reproduce
Unfortunately we could reproduce this only on our production environment.
The authorization application that serve the jwks shows no problem and the endpoint respond in 1-2ms
This only occurs during the application initial boot.
We started to have the problem with micronaut-security 4.5.0 (micronaut 4.3.1). We think it's related to #1509
Probably due to the changes to a reactive implementation or the change to use micronaut-client.
The problem is still present in 4.6.6.
We could workaround the problem with a static jwks configuration.
We ruled out the migration to Micronaut 4.x since the application have been running with micronaut 4.2.4 without problem.
Environment Information
Ubuntu 20.04 (docker image ubuntu:latest)
java temurin-17.0.10+7
The applications runs on k8s
The problem affect many pods at the same time
The jwks application is reached internally via a kubernetes service.
Example Application
No response
Version
Since micronaut-security 4.5.0 (micronaut 4.3.1)
The text was updated successfully, but these errors were encountered: