ReadTimeoutException loading JWK causing exponential resource consumption since 4.5.0 #1633

etremblay · 2024-03-13T15:47:44Z

Expected Behavior

When the JWKS endpoint is available, the application load the key information and boot normally.

When the JWKS endpoint is down, the application retry untile the JWKS is up.

Actual Behaviour

Application fail with

Exception loading JWK
io.micronaut.http.client.exceptions.ReadTimeoutException: Read Timeout
	at io.micronaut.http.client.exceptions.ReadTimeoutException.<clinit>(ReadTimeoutException.java:26)
	at io.micronaut.http.client.netty.DefaultHttpClient.lambda$exchangeImpl$28(DefaultHttpClient.java:1156)
	at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:94)
	at reactor.core.publisher.SerializedSubscriber.onError(SerializedSubscriber.java:124)
	at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.handleTimeout(FluxTimeout.java:296)
	at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.doTimeout(FluxTimeout.java:281)
	at reactor.core.publisher.FluxTimeout$TimeoutTimeoutSubscriber.onNext(FluxTimeout.java:420)
	at reactor.core.publisher.FluxOnErrorReturn$ReturnSubscriber.onNext(FluxOnErrorReturn.java:162)
	at reactor.core.publisher.MonoDelay$MonoDelayRunnable.propagateDelay(MonoDelay.java:270)
	at reactor.core.publisher.MonoDelay$MonoDelayRunnable.run(MonoDelay.java:285)
	at io.micronaut.core.propagation.PropagatedContext.lambda$wrap$3(PropagatedContext.java:211)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)

We observed that after each attempt, the cpu and memory consumption rise up very fast.

Giving more resource to the pod does not resolve the problem.

The application never recover from this.

Steps To Reproduce

Unfortunately we could reproduce this only on our production environment.

The authorization application that serve the jwks shows no problem and the endpoint respond in 1-2ms

This only occurs during the application initial boot.

We started to have the problem with micronaut-security 4.5.0 (micronaut 4.3.1). We think it's related to #1509

Probably due to the changes to a reactive implementation or the change to use micronaut-client.

The problem is still present in 4.6.6.

We could workaround the problem with a static jwks configuration.

We ruled out the migration to Micronaut 4.x since the application have been running with micronaut 4.2.4 without problem.

Environment Information

Ubuntu 20.04 (docker image ubuntu:latest)
java temurin-17.0.10+7

The applications runs on k8s
The problem affect many pods at the same time
The jwks application is reached internally via a kubernetes service.

Example Application

No response

Version

Since micronaut-security 4.5.0 (micronaut 4.3.1)

The text was updated successfully, but these errors were encountered:

sdelamo · 2024-03-14T06:05:15Z

Thanks for reporting @etremblay You can disable the use of the Micronaut HTTP Client with :

micronaut.security.token.jwt.signatures.jwks-client.http-client.enabled=false

@etremblay can you try disabling the use of Micronaut HTTP Client for JWKS and report back?

@jeremyg484 make sure we document that setting. @jeremyg484 we need to add tests for retries for the Micronaut Framework HTTP Client.

graemerocher · 2024-03-14T08:24:52Z

this typically occurs when the server and client share the same event loop and there are not enough threads in the pod for both so the event loop deadlocks if you perform a blocking operation. //cc @yawkat

etremblay · 2024-03-14T13:04:40Z

Thanks for reporting @etremblay You can disable the use of the Micronaut HTTP Client with :
micronaut.security.token.jwt.signatures.jwks-client.http-client.enabled=false
@etremblay can you try disabling the use of Micronaut HTTP Client for JWKS and report back?

@jeremyg484 make sure we document that setting. @jeremyg484 we need to add tests for retries for the Micronaut Framework HTTP Client.

Hi, @sdelamo thanks for the quick reply.

We almost tried to disable the http client like you say, I'm not sure if it is documented but we found it reading the source code.

Unfortunately, we opted for the static jwks file that seemed a safer option. We could not affort another production downtime affecting thousands of users.

jeremyg484 · 2024-03-14T14:50:03Z

We almost tried to disable the http client like you say, I'm not sure if it is documented but we found it reading the source code.

Unfortunately, we opted for the static jwks file that seemed a safer option. We could not affort another production downtime affecting thousands of users.

@etremblay Just to clarify, setting micronaut.security.token.jwt.signatures.jwks-client.http-client.enabled=false will cause the module to fall back to the default implementation provided by the JWT library that we use (https://connect2id.com/products/nimbus-jose-jwt). That implementation uses java.net.* classes, and is what was used prior to the addition of the Micronaut HttpClient based implementation. Thus it should be the exact same behavior as in versions prior to 4.5.0. I will make sure this point is added to the documentation while working on the fix to the overall issue.

etremblay · 2024-03-14T16:15:53Z

We almost tried to disable the http client like you say, I'm not sure if it is documented but we found it reading the source code.
Unfortunately, we opted for the static jwks file that seemed a safer option. We could not affort another production downtime affecting thousands of users.

@etremblay Just to clarify, setting micronaut.security.token.jwt.signatures.jwks-client.http-client.enabled=false will cause the module to fall back to the default implementation provided by the JWT library that we use (https://connect2id.com/products/nimbus-jose-jwt). That implementation uses java.net.* classes, and is what was used prior to the addition of the Micronaut HttpClient based implementation. Thus it should be the exact same behavior as in versions prior to 4.5.0. I will make sure this point is added to the documentation while working on the fix to the overall issue.

@jeremyg484 From what I've seen, the original implementation was not called inside a reactive Mono, we feared that the problem could be caused by the switch to a reactive implementation and not the switch to micronaut http client.

Even with the fallack on old implementation, it is still called inside a Mono

jeremyg484 · 2024-03-15T03:13:03Z

@etremblay I think I have narrowed down the issue, and it is likely a deadlock issue as @graemerocher mentioned. I believe I know the cause and I have a couple of ideas of how to resolve it. But I would like to get a little more info about your particular setup so that I can be sure the changes I have in mind will fix this particular problem.

Can you provide more details about how you are configuring JWT and JWKS? I would like to understand what exactly about your configuration and/or code is causing the JWK set to be fetched at startup time of your application.

How are the retries happening? Do you have the HttpClient (default or service-specific that is intended to be used for the JWK set fetching) configured for retries, or are you using some other means?

etremblay · 2024-03-15T12:39:41Z

@jeremyg484 The hint about the event loop and threads made me and my team re-think about the problem in a different angle.

First of all, I was mislead when I taught the problem occured during the application boot time. In fact, it occurs when it receive it's first http request protected by micronaut security.

When the app is under high pressure, k8s scale up and spawn a new pod (or more). When the application healthcheck is ready, the cluster can redirect many requests at once. The health endpoint itself is not protected by micronaut-security and does not initialize the jwks.

We then have a situation with many inbond request waiting for authentication and one outboud request for the jwks probably waiting for the event loop.

Moreover, the configured cpu request for the pod is 500m so the node may allocate less than one cpu for the application. Not shure if it would limit the number of thread available for the application but there is a stron smell there.

We intend to try and reproduce the issue in a non-production environment ans see if a higher cpu request fix it. I will keep you informed.

jeremyg484 · 2024-03-15T13:14:53Z

@etremblay That is very useful info, thank you. There is definitely a place in the new JWK set fetching code where we are blocking when we shouldn't be, and I am going to fix that. With the allocation of less than one cpu, it makes sense to me that you would be more likely encounter the deadlock. Allocating more resources might get you past the problem with the current version, but until the fix I have in mind is in place, there will still be a chance that it could occur, just at a later point - we will get that sorted out shortly.

JwtTokenValidator is modified to explicitly execute JwtValidator.validate in a separate thread from the Netty event loop. JwtValidator.validate is blocking in its current implementation and this allows it to be executed in a non-blocking fashion. This is a short-term fix for #1633, where the preferred long-term fix would be to refactor the JwtValidator API to be non-blocking.

etremblay · 2024-03-18T18:44:20Z

Thanks !

etremblay mentioned this issue Mar 13, 2024

file: prefix is not working for static jwks configurations #1635

Merged

sdelamo assigned jeremyg484 Mar 14, 2024

sdelamo added the type: bug Something isn't working label Mar 14, 2024

jeremyg484 mentioned this issue Mar 15, 2024

Offload JWK Set fetching from the event loop #1641

Merged

sdelamo closed this as completed Mar 18, 2024

jeremyg484 mentioned this issue Mar 25, 2024

Error instantiating bean of type [io.micronaut.security.token.jwt.validator.JwtTokenValidator] #1036

Closed

jbreton mentioned this issue Jun 11, 2024

Reative implementation of validation with remote JWKS fetch keys on every validations #1734

Open

kschulst mentioned this issue Aug 29, 2024

ResponseClosedException: Connection closed before response was received statisticsnorway/maskinporten-guardian#44

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReadTimeoutException loading JWK causing exponential resource consumption since 4.5.0 #1633

ReadTimeoutException loading JWK causing exponential resource consumption since 4.5.0 #1633

etremblay commented Mar 13, 2024

sdelamo commented Mar 14, 2024

graemerocher commented Mar 14, 2024

etremblay commented Mar 14, 2024 •

edited

Loading

jeremyg484 commented Mar 14, 2024

etremblay commented Mar 14, 2024

jeremyg484 commented Mar 15, 2024

etremblay commented Mar 15, 2024

jeremyg484 commented Mar 15, 2024

etremblay commented Mar 18, 2024

ReadTimeoutException loading JWK causing exponential resource consumption since 4.5.0 #1633

ReadTimeoutException loading JWK causing exponential resource consumption since 4.5.0 #1633

Comments

etremblay commented Mar 13, 2024

Expected Behavior

Actual Behaviour

Steps To Reproduce

Environment Information

Example Application

Version

sdelamo commented Mar 14, 2024

graemerocher commented Mar 14, 2024

etremblay commented Mar 14, 2024 • edited Loading

jeremyg484 commented Mar 14, 2024

etremblay commented Mar 14, 2024

jeremyg484 commented Mar 15, 2024

etremblay commented Mar 15, 2024

jeremyg484 commented Mar 15, 2024

etremblay commented Mar 18, 2024

etremblay commented Mar 14, 2024 •

edited

Loading