Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReadTimeoutException loading JWK causing exponential resource consumption since 4.5.0 #1633

Closed
etremblay opened this issue Mar 13, 2024 · 9 comments
Assignees
Labels
type: bug Something isn't working

Comments

@etremblay
Copy link

Expected Behavior

When the JWKS endpoint is available, the application load the key information and boot normally.

When the JWKS endpoint is down, the application retry untile the JWKS is up.

Actual Behaviour

Application fail with

Exception loading JWK
io.micronaut.http.client.exceptions.ReadTimeoutException: Read Timeout
	at io.micronaut.http.client.exceptions.ReadTimeoutException.<clinit>(ReadTimeoutException.java:26)
	at io.micronaut.http.client.netty.DefaultHttpClient.lambda$exchangeImpl$28(DefaultHttpClient.java:1156)
	at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:94)
	at reactor.core.publisher.SerializedSubscriber.onError(SerializedSubscriber.java:124)
	at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.handleTimeout(FluxTimeout.java:296)
	at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.doTimeout(FluxTimeout.java:281)
	at reactor.core.publisher.FluxTimeout$TimeoutTimeoutSubscriber.onNext(FluxTimeout.java:420)
	at reactor.core.publisher.FluxOnErrorReturn$ReturnSubscriber.onNext(FluxOnErrorReturn.java:162)
	at reactor.core.publisher.MonoDelay$MonoDelayRunnable.propagateDelay(MonoDelay.java:270)
	at reactor.core.publisher.MonoDelay$MonoDelayRunnable.run(MonoDelay.java:285)
	at io.micronaut.core.propagation.PropagatedContext.lambda$wrap$3(PropagatedContext.java:211)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)

We observed that after each attempt, the cpu and memory consumption rise up very fast.

Giving more resource to the pod does not resolve the problem.

The application never recover from this.

Steps To Reproduce

Unfortunately we could reproduce this only on our production environment.

The authorization application that serve the jwks shows no problem and the endpoint respond in 1-2ms

This only occurs during the application initial boot.

We started to have the problem with micronaut-security 4.5.0 (micronaut 4.3.1). We think it's related to #1509

Probably due to the changes to a reactive implementation or the change to use micronaut-client.

The problem is still present in 4.6.6.

We could workaround the problem with a static jwks configuration.

We ruled out the migration to Micronaut 4.x since the application have been running with micronaut 4.2.4 without problem.

Environment Information

Ubuntu 20.04 (docker image ubuntu:latest)
java temurin-17.0.10+7

The applications runs on k8s
The problem affect many pods at the same time
The jwks application is reached internally via a kubernetes service.

Example Application

No response

Version

Since micronaut-security 4.5.0 (micronaut 4.3.1)

@sdelamo
Copy link
Contributor

sdelamo commented Mar 14, 2024

Thanks for reporting @etremblay You can disable the use of the Micronaut HTTP Client with :

micronaut.security.token.jwt.signatures.jwks-client.http-client.enabled=false

@etremblay can you try disabling the use of Micronaut HTTP Client for JWKS and report back?

@jeremyg484 make sure we document that setting. @jeremyg484 we need to add tests for retries for the Micronaut Framework HTTP Client.

@sdelamo sdelamo added the type: bug Something isn't working label Mar 14, 2024
@graemerocher
Copy link
Contributor

this typically occurs when the server and client share the same event loop and there are not enough threads in the pod for both so the event loop deadlocks if you perform a blocking operation. //cc @yawkat

@etremblay
Copy link
Author

etremblay commented Mar 14, 2024

Thanks for reporting @etremblay You can disable the use of the Micronaut HTTP Client with :

micronaut.security.token.jwt.signatures.jwks-client.http-client.enabled=false

@etremblay can you try disabling the use of Micronaut HTTP Client for JWKS and report back?

@jeremyg484 make sure we document that setting. @jeremyg484 we need to add tests for retries for the Micronaut Framework HTTP Client.

Hi, @sdelamo thanks for the quick reply.

We almost tried to disable the http client like you say, I'm not sure if it is documented but we found it reading the source code.

Unfortunately, we opted for the static jwks file that seemed a safer option. We could not affort another production downtime affecting thousands of users.

@jeremyg484
Copy link
Contributor

We almost tried to disable the http client like you say, I'm not sure if it is documented but we found it reading the source code.

Unfortunately, we opted for the static jwks file that seemed a safer option. We could not affort another production downtime affecting thousands of users.

@etremblay Just to clarify, setting micronaut.security.token.jwt.signatures.jwks-client.http-client.enabled=false will cause the module to fall back to the default implementation provided by the JWT library that we use (https://connect2id.com/products/nimbus-jose-jwt). That implementation uses java.net.* classes, and is what was used prior to the addition of the Micronaut HttpClient based implementation. Thus it should be the exact same behavior as in versions prior to 4.5.0. I will make sure this point is added to the documentation while working on the fix to the overall issue.

@etremblay
Copy link
Author

We almost tried to disable the http client like you say, I'm not sure if it is documented but we found it reading the source code.
Unfortunately, we opted for the static jwks file that seemed a safer option. We could not affort another production downtime affecting thousands of users.

@etremblay Just to clarify, setting micronaut.security.token.jwt.signatures.jwks-client.http-client.enabled=false will cause the module to fall back to the default implementation provided by the JWT library that we use (https://connect2id.com/products/nimbus-jose-jwt). That implementation uses java.net.* classes, and is what was used prior to the addition of the Micronaut HttpClient based implementation. Thus it should be the exact same behavior as in versions prior to 4.5.0. I will make sure this point is added to the documentation while working on the fix to the overall issue.

@jeremyg484 From what I've seen, the original implementation was not called inside a reactive Mono, we feared that the problem could be caused by the switch to a reactive implementation and not the switch to micronaut http client.

Even with the fallack on old implementation, it is still called inside a Mono

@jeremyg484
Copy link
Contributor

@etremblay I think I have narrowed down the issue, and it is likely a deadlock issue as @graemerocher mentioned. I believe I know the cause and I have a couple of ideas of how to resolve it. But I would like to get a little more info about your particular setup so that I can be sure the changes I have in mind will fix this particular problem.

Can you provide more details about how you are configuring JWT and JWKS? I would like to understand what exactly about your configuration and/or code is causing the JWK set to be fetched at startup time of your application.

How are the retries happening? Do you have the HttpClient (default or service-specific that is intended to be used for the JWK set fetching) configured for retries, or are you using some other means?

@etremblay
Copy link
Author

@jeremyg484 The hint about the event loop and threads made me and my team re-think about the problem in a different angle.

First of all, I was mislead when I taught the problem occured during the application boot time. In fact, it occurs when it receive it's first http request protected by micronaut security.

When the app is under high pressure, k8s scale up and spawn a new pod (or more). When the application healthcheck is ready, the cluster can redirect many requests at once. The health endpoint itself is not protected by micronaut-security and does not initialize the jwks.

We then have a situation with many inbond request waiting for authentication and one outboud request for the jwks probably waiting for the event loop.

Moreover, the configured cpu request for the pod is 500m so the node may allocate less than one cpu for the application. Not shure if it would limit the number of thread available for the application but there is a stron smell there.

We intend to try and reproduce the issue in a non-production environment ans see if a higher cpu request fix it. I will keep you informed.

@jeremyg484
Copy link
Contributor

@etremblay That is very useful info, thank you. There is definitely a place in the new JWK set fetching code where we are blocking when we shouldn't be, and I am going to fix that. With the allocation of less than one cpu, it makes sense to me that you would be more likely encounter the deadlock. Allocating more resources might get you past the problem with the current version, but until the fix I have in mind is in place, there will still be a chance that it could occur, just at a later point - we will get that sorted out shortly.

jeremyg484 added a commit that referenced this issue Mar 15, 2024
JwtTokenValidator is modified to explicitly execute
JwtValidator.validate in a separate thread from the Netty event loop.

JwtValidator.validate is blocking in its current implementation and
this allows it to be executed in a non-blocking fashion.

This is a short-term fix for #1633, where the preferred long-term fix
would be to refactor the JwtValidator API to be non-blocking.
sdelamo pushed a commit that referenced this issue Mar 18, 2024
JwtTokenValidator is modified to explicitly execute
JwtValidator.validate in a separate thread from the Netty event loop.

JwtValidator.validate is blocking in its current implementation and
this allows it to be executed in a non-blocking fashion.

This is a short-term fix for #1633, where the preferred long-term fix
would be to refactor the JwtValidator API to be non-blocking.
sdelamo pushed a commit that referenced this issue Mar 18, 2024
JwtTokenValidator is modified to explicitly execute
JwtValidator.validate in a separate thread from the Netty event loop.

JwtValidator.validate is blocking in its current implementation and
this allows it to be executed in a non-blocking fashion.

This is a short-term fix for #1633, where the preferred long-term fix
would be to refactor the JwtValidator API to be non-blocking.
@sdelamo sdelamo closed this as completed Mar 18, 2024
@etremblay
Copy link
Author

Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
No open projects
Status: Done
Development

No branches or pull requests

4 participants