Use updated onPreAuth from Platform #71552

jfsiii · 2020-07-13T21:37:28Z

Summary

Track the number of open requests to certain routes and reject them with a 429 if we are above that limit.

The limit is controlled via the new config value xpack.ingestManager.fleet.maxConcurrentConnections. As a new flag it won't be immediately available in Cloud UI. I'll double check the timeline, but it's possible it won't be available for 7.9.

This is implemented with the updated onPreAuth platform lifecycle from #70775 and code/ideas from #70495 & jfsiii#4

We add a global interceptor (meaning it's run on every request, not just those for our plugin) which uses a single counter to track total open connections to any route which includes the LIMITED_CONCURRENCY_ROUTE_TAG tag. The counter will increment when the request starts and decrement on close. The only routes which have this logic are the Fleet enroll & ack routes; not checkin

Open Questions

What features must we have in before Feature Freeze?
How do we set expose the flag to users? What name? Should it be under xpack.ingestManager.fleet?
What's a good default value the limit? Can/should we set set it based on the characteristics of the container?

Answers

feature exposed via flag, only applies to enroll & ack routes
went with xpack.ingestManager.fleet.maxConcurrentConnections
off by default and we can give advice for better values after more testing)

Checklist

Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios.
added 3 tests to make sure the preAuth handler isn't added if the flag is set to 0 b8288ae
Created [Ingest Manager] Add tests for the limited concurrency routes #71744 for more tests

PR vs now from a PR last week. Will update with up-to-date comparison

elasticmachine · 2020-07-13T21:37:30Z

Pinging @elastic/ingest-management (Team:Ingest Management)

jfsiii · 2020-07-13T21:38:46Z

x-pack/plugins/ingest_manager/server/routes/global_interceptors.ts

+  return tags.includes(LIMITED_CONCURRENCY_ROUTE_TAG);
+}
+
+const LIMITED_CONCURRENCY_MAX_REQUESTS = 250;


How should this be configurable? Do we expose as a config flag?

++ to having it as a config option, it fits nicely under xpack.ingestManager.fleet

the default value of 250 seems low to me, but I'm not up to date on our load testing results

This indeed seems low, how did you get to that number?

resolving this since we've switched to 0 (off) by default

x-pack/plugins/ingest_manager/server/plugin.ts

jfsiii · 2020-07-13T21:42:22Z

x-pack/plugins/ingest_manager/server/routes/global_interceptors.ts

+import { KibanaRequest, LifecycleResponseFactory, OnPreAuthToolkit } from 'kibana/server';
+import { LIMITED_CONCURRENCY_ROUTE_TAG } from '../../common';
+
+class MaxCounter {


We can handle this any number of ways. I just chose something that allowed the logic to be defined outside the route handler

jfsiii · 2020-07-13T21:46:23Z

x-pack/plugins/ingest_manager/server/routes/global_interceptors.ts

+const LIMITED_CONCURRENCY_MAX_REQUESTS = 250;
+const counter = new MaxCounter(LIMITED_CONCURRENCY_MAX_REQUESTS);
+
+export function preAuthHandler(


Should I test this handler? If so, can someone point me in the right direction :) ?

for testing, maybe looking at routes/package_config/handlers.test.ts would be helpful? it has usage of httpServerMock.createKibanaRequest for creating a request object, that method has options for creating it with tags, etc.

then I think you could mock LIMITED_CONCURRENCY_MAX_REQUESTS to 1 and mock toolkit.next() to resolve after waiting for a second. then kicking off multiple calls to the handler inside a Promise.all should return an array of of responses whose status codes are [200, 503, 503, ...]

jen-huang · 2020-07-13T23:41:05Z

x-pack/plugins/ingest_manager/server/routes/global_interceptors.ts

+  return tags.includes(LIMITED_CONCURRENCY_ROUTE_TAG);
+}
+
+const LIMITED_CONCURRENCY_MAX_REQUESTS = 250;


++ to having it as a config option, it fits nicely under xpack.ingestManager.fleet

the default value of 250 seems low to me, but I'm not up to date on our load testing results

jen-huang · 2020-07-13T23:43:31Z

x-pack/plugins/ingest_manager/server/routes/global_interceptors.ts

+const LIMITED_CONCURRENCY_MAX_REQUESTS = 250;
+const counter = new MaxCounter(LIMITED_CONCURRENCY_MAX_REQUESTS);
+
+export function preAuthHandler(


would be nice to name this something Fleet or agent-related to emphasize that this handler is for those routes

jen-huang · 2020-07-13T23:51:11Z

x-pack/plugins/ingest_manager/server/routes/global_interceptors.ts

+
+  counter.increase();
+
+  // requests.events.aborted$ has a bug where it's fired even when the request completes...


how can we decrease the counter if/when the bug is fixed? 🤔

There was some discussion about this starting in #70495 (comment)

It's a "bug" but there is a test to ensure it works. There's also a possibility Platform will add a request.events.completed$ which we can use instead

It's a "bug" but there is a test to ensure it works. There's also a possibility Platform will add a request.events.completed$ which we can use instead

This is entirely pedantic, it's not really a bug as much as an "implementation detail" per #70495 (comment)

@kobelb agreed. I updated the comment in https://github.com/elastic/kibana/pull/71552/files/8174114aec75a3e8a528fdc8dcc04e1c75adff53#diff-ce1cb1cde75c2cf4e5135e1280d24a9dR64-R65

jen-huang · 2020-07-13T23:57:56Z

x-pack/plugins/ingest_manager/server/routes/global_interceptors.ts

+const LIMITED_CONCURRENCY_MAX_REQUESTS = 250;
+const counter = new MaxCounter(LIMITED_CONCURRENCY_MAX_REQUESTS);
+
+export function preAuthHandler(


for testing, maybe looking at routes/package_config/handlers.test.ts would be helpful? it has usage of httpServerMock.createKibanaRequest for creating a request object, that method has options for creating it with tags, etc.

then I think you could mock LIMITED_CONCURRENCY_MAX_REQUESTS to 1 and mock toolkit.next() to resolve after waiting for a second. then kicking off multiple calls to the handler inside a Promise.all should return an array of of responses whose status codes are [200, 503, 503, ...]

jfsiii · 2020-07-14T00:30:33Z

@elasticmachine merge upstream

ph · 2020-07-14T00:37:25Z

@jfsiii Can you clarify how it work?

Any route which includes the LIMITED_CONCURRENCY_ROUTE_TAG tag will be counted. The routes are currently certain Fleet Enroll, Ack, and Checkin routes.

What is the actual behavior here? does it:

Set a hard limit to the number of simultaneous connected client? Like a hard limit X, not more than X clients can keep an open connection to the checkin endpoint.
Or it actually prevents bust and throttles them?

@nchaulet could you take a look?

jfsiii · 2020-07-14T01:04:23Z

@ph I updated the description to include more details

roncohen · 2020-07-14T07:33:30Z

thanks @jfsiii!

PR description says 429 but the code says 503 AFAICT. Let's change the code to produce 429s.
Let's remove the "retry-after" header and implement exponential backoff in the agent for all these endpoints cc @michalpristas
I understand it's a global request limit of 250 counting all the endpoints together. Since we're doing long polling, if you enroll more than LIMITED_CONCURRENCY_MAX_REQUESTS agents, all requests to any of the endpoints will fail after that. The behavior I think we want is that after an agent is checked-in and long-polling, we decrement the counter to allow for more agents to check-in. We're not trying to limit the amount of agents that can be long-polling, just trying to protect against a surge of requests coming in a the same time.
I think we might need to keep separate counters and perhaps separate limits for the the endpoints eventually, but let's test more first.
After applying (3), I suspect we need to set the limits lower. More testing is needed, but I'd set it to something like 50 for now.

jfsiii · 2020-07-14T10:15:49Z

thanks, @roncohen

PR description says 429 but the code says 503 AFAICT. Let's change the code to produce 429s.

I updated the description. Can you say more about why 429 vs 503? I think a 503 is more accurate as it's a temporary system/server issue, not one with the request. A 429 would be if they were exceeding our API request rate, but we don't have one.

Let's remove the "retry-after" header and implement exponential backoff in the agent for all these endpoints cc @michalpristas

Adding the Retry-After header doesn't initiate or enforce anything. It's there as an indication of how long the server will be busy. It still up to the clients when they retry.

I think we might need to keep separate counters and perhaps separate limits for the the endpoints eventually, but let's test more first.

I wondered the same thing and figured we could come back to it.

Thanks for (3). I'll ping people in the code comments.

jfsiii · 2020-07-14T10:35:44Z

x-pack/plugins/ingest_manager/server/routes/global_interceptors.ts

+
+  // requests.events.aborted$ has a bug where it's fired even when the request completes...
+  // we can take advantage of this bug just for load testing...
+  request.events.aborted$.toPromise().then(() => counter.decrease());


As the comment says this decrements the counter when a request completes. However, /checkin is using long-polling so I don't think the request "completes" in the same way.

As @roncohen mentions in (3) in #71552 (comment), I think the counter will increment with each /checkin but not decrement.

It seems like we could revert to using core.http.registerOnPreResponse to decrement during a checkin response , but the reason we moved away from that was it missed aborted/failed connections #70495 (comment) and as the first line says requests.events.aborted$ has a bug where it's fired even when the request completes so that will result in double-decrementing.

Perhaps we could do some checking in both the response and aborted handler to see if theres a 429/503 like the initial POC https://github.com/elastic/kibana/pull/70495/files#diff-f3ee51f84520a4eb03ca041ff4c3d9a2R182

cc @kobelb @nchaulet

@jfsiii I think we need to do the decrementing inside the check-in route when the actual work of checking in the agent has completed and the request transitions to long polling mode. Could the interceptor augment the context passed to the route such that the route can decrement when it sees fit? And then we need to ensure that a given request can only be decremented once, so it doesn't get decremented again when the response finally gets send back.

I agree with @roncohen here, I think the behavior is to "protect from multiple connection" until enough agents have transitioned to long polling where their usage become low.

Could the interceptor augment the context passed to the route such that the route can decrement when it sees fit?

I don't believe that it can at the moment... I think we have two options here

Don't use the pre-auth handler to reply with a 429 for these long-polling routes. If we're doing all of the rate-limiting within the route handler itself, this becomes possible without any changes from the Kibana platform.

Make changes to the Kibana platform to allow us to correlate the request seen in the pre-auth handler to the request within the route handler and the other lifecycle methods. This would allow us to only decrement the counter once by using a WeakSet. @joshdover, is there some other primitive or concept within the Kibana platform that we could use to fulfill this behavior?

2. is there some other primitive or concept within the Kibana platform that we could use to fulfill this behavior?

Not that is exposed currently. I am working on adding support for an id property on KibanaRequest that would be stable across lifecycle methods (#71019), but that will not make it for 7.9. I'm also not sure it would be safe to use for this use-case because the id could come from a proxy that is sending an X-Opaque-Id header which we can't guarantee will be unique? Maybe we need an internal ID that is unique that we can rely on?

We do keep a reference to the original Hapi request on the KibanaRequest object but it's been very purposefully hidden so that plugins can't abuse it.

roncohen · 2020-07-14T12:02:38Z

Can you say more about why 429 vs 503? I think a 503 is more accurate as it's a temporary system/server issue, not one with the request. A 429 would be if they were exceeding our API request rate, but we don't have one.

The way I see it, 503 means the service is unavailable. In this case, it's not actually unavailable, it's just that there are currently too many requests going on to specific routes and the server is opting to reject some of those requests to make sure it can deal properly with the ones that are ongoing and allow other kinds of request (from the user browsing the UI for example) to proceed. So on the contrary, the service is alive and well because it's protecting itself ;)

If there's no connection to Elasticsearch, Kibana will return a 503 AFAIK. There are probably other scenarios where Kibana returns a 503. If we used 503s here, someone looking at HTTP response code graphs for Kibana will not be able to tell whether there's a significant problem like Elasticsearch being unavailable or it's just the protection from this PR being activated.

Finally, 429 doesn't have to mean that a particular user/account is making too many requests. The spec says:

Note that this specification does not define how the origin server
identifies the user, nor how it counts requests. For example, an
origin server that is limiting request rates can do so based upon
counts of requests on a per-resource basis, across the entire server,
or even among a set of servers. Likewise, it might identify the user
by its authentication credentials, or a stateful cookie.

ph · 2020-07-14T14:47:03Z

I didn't know that Kibana could return a 503, in that case going with 429 might be a better solution, and this is the exact behavior with Elasticsearch when too many bulk requests are made to the server.

jfsiii · 2020-07-14T16:51:33Z

@elasticmachine merge upstream

jen-huang

Did a smoke test locally enrolling 1 agent, collecting system metrics, and deploying an updated config, everything looks good.

Giving 👍 due to time constraints, but I would like to see test coverage added for limited_concurrency.ts. At the minimum validate that we can get 429 status codes, and also validate that the counter decreases accordingly.

jfsiii · 2020-07-14T19:32:38Z

@jen-huang thanks! I put a test to confirm we don't add the preAuth handler unless we have a config.

We'll see what CI holds, but this is what I got locally

 PASS  plugins/ingest_manager/server/routes/limited_concurrency.test.ts
  registerLimitedConcurrencyRoutes
    ✓ doesn't call registerOnPreAuth if maxConcurrentConnections is 0 (10ms)
    ✓ calls registerOnPreAuth once if maxConcurrentConnections is 1 (5ms)
    ✓ calls registerOnPreAuth once if maxConcurrentConnections is 1000 (3ms)

I'll try to add some more (like "do we only process the correct routes?") before merging, but I'll create a link for follow tests and link it when I merge this.

When enrolling and the server currently handle to many concurrent request it will return a 429 status code. The enroll subcommand will retry to enroll with an exponential backoff. (Init 15sec and max 10mins) This also adjust the backoff logic in the ACK. Requires: elastic/kibana#71552

kibanamachine · 2020-07-14T21:10:21Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request
Commit: b8288ae

Build metrics

✅ unchanged

History

💚 Build #61502 succeeded c75bd17
💛 Build #61430 was flaky 8174114
💚 Build #61207 succeeded 4a74961
💚 Build #61118 succeeded 6b42af7

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

#19918) * [Elastic Agent] Handle 429 response from the server and adjust backoff When enrolling and the server currently handle to many concurrent request it will return a 429 status code. The enroll subcommand will retry to enroll with an exponential backoff. (Init 15sec and max 10mins) This also adjust the backoff logic in the ACK. Requires: elastic/kibana#71552 * changelog * Change values

elastic#19918) * [Elastic Agent] Handle 429 response from the server and adjust backoff When enrolling and the server currently handle to many concurrent request it will return a 429 status code. The enroll subcommand will retry to enroll with an exponential backoff. (Init 15sec and max 10mins) This also adjust the backoff logic in the ACK. Requires: elastic/kibana#71552 * changelog * Change values (cherry picked from commit 2db2152)

* Use updated onPreAuth from Platform * Add config flag. Increase default value. * Set max connections flag default to 0 (disabled) * Don't use limiting logic on checkin route * Confirm preAuth handler only added when max > 0 Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

#19918) (#19926) * [Elastic Agent] Handle 429 response from the server and adjust backoff When enrolling and the server currently handle to many concurrent request it will return a 429 status code. The enroll subcommand will retry to enroll with an exponential backoff. (Init 15sec and max 10mins) This also adjust the backoff logic in the ACK. Requires: elastic/kibana#71552 * changelog * Change values (cherry picked from commit 2db2152)

* Use updated onPreAuth from Platform * Add config flag. Increase default value. * Set max connections flag default to 0 (disabled) * Don't use limiting logic on checkin route * Confirm preAuth handler only added when max > 0 Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

elastic#19918) * [Elastic Agent] Handle 429 response from the server and adjust backoff When enrolling and the server currently handle to many concurrent request it will return a 429 status code. The enroll subcommand will retry to enroll with an exponential backoff. (Init 15sec and max 10mins) This also adjust the backoff logic in the ACK. Requires: elastic/kibana#71552 * changelog * Change values

Use updated onPreAuth from Platform

6b42af7

jfsiii added the Team:Fleet Team label for Observability Data Collection Fleet team label Jul 13, 2020

jfsiii requested a review from a team July 13, 2020 21:37

jfsiii commented Jul 13, 2020

View reviewed changes

x-pack/plugins/ingest_manager/server/plugin.ts Outdated Show resolved Hide resolved

jfsiii commented Jul 13, 2020

View reviewed changes

jfsiii requested review from roncohen, ph and a team July 13, 2020 21:44

jfsiii commented Jul 13, 2020

View reviewed changes

jfsiii mentioned this pull request Jul 13, 2020

[Ingest Manager] Protect Kibana of spikes cause by Elastic Agent enrollment or checkin. #67987

Closed

jfsiii added release_note:skip Skip the PR/issue when compiling release notes v7.9.0 v8.0.0 labels Jul 13, 2020

jen-huang reviewed Jul 13, 2020

View reviewed changes

Merge branch 'master' into fleet-concurrency-via-latest-onpreauth

4a74961

pgayvallet approved these changes Jul 14, 2020

View reviewed changes

jfsiii commented Jul 14, 2020

View reviewed changes

Add config flag. Increase default value.

8174114

jfsiii requested a review from jen-huang July 14, 2020 15:50

Merge branch 'master' into fleet-concurrency-via-latest-onpreauth

c75bd17

John Schulz added 2 commits July 14, 2020 13:26

Set max connections flag default to 0 (disabled)

7cc7f2d

Don't use limiting logic on checkin route

aba7a45

jen-huang approved these changes Jul 14, 2020

View reviewed changes

Confirm preAuth handler only added when max > 0

b8288ae

ph mentioned this pull request Jul 14, 2020

[Elastic Agent] Handle 429 response from the server and adjust backoff elastic/beats#19918

Merged

6 tasks

This was referenced Jul 14, 2020

[Ingest Manager] Add tests for the limited concurrency routes #71744

Closed

[Ingest Manager] Enforce maximum open connections for some Fleet routes #71221

Closed

jfsiii merged commit 04cdb5a into elastic:master Jul 14, 2020

ph mentioned this pull request Jul 14, 2020

Cherry-pick #19918 to 7.x: [Elastic Agent] Handle 429 response from the server and adjust backoff elastic/beats#19926

Merged

6 tasks

jfsiii mentioned this pull request Jul 14, 2020

[7.x] Use updated onPreAuth from Platform (#71552) #71783

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use updated onPreAuth from Platform #71552

Use updated onPreAuth from Platform #71552

jfsiii commented Jul 13, 2020 •

edited

Loading

elasticmachine commented Jul 13, 2020

jfsiii Jul 13, 2020

jen-huang Jul 13, 2020

ph Jul 14, 2020

jfsiii Jul 14, 2020

jfsiii Jul 13, 2020

jfsiii Jul 13, 2020

jen-huang Jul 13, 2020

jen-huang Jul 13, 2020

jen-huang Jul 13, 2020

jen-huang Jul 13, 2020

jfsiii Jul 14, 2020

kobelb Jul 14, 2020

jfsiii Jul 14, 2020

jen-huang Jul 13, 2020

jfsiii commented Jul 14, 2020

ph commented Jul 14, 2020

jfsiii commented Jul 14, 2020

roncohen commented Jul 14, 2020

jfsiii commented Jul 14, 2020

jfsiii Jul 14, 2020

roncohen Jul 14, 2020 •

edited

Loading

ph Jul 14, 2020

kobelb Jul 14, 2020

joshdover Jul 14, 2020

roncohen commented Jul 14, 2020

ph commented Jul 14, 2020

jfsiii commented Jul 14, 2020

jen-huang left a comment

jfsiii commented Jul 14, 2020

kibanamachine commented Jul 14, 2020


		counter.increase();

		// requests.events.aborted$ has a bug where it's fired even when the request completes...

Use updated onPreAuth from Platform #71552

Use updated onPreAuth from Platform #71552

Conversation

jfsiii commented Jul 13, 2020 • edited Loading

Summary

Open Questions

Answers

Checklist

elasticmachine commented Jul 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jfsiii commented Jul 14, 2020

ph commented Jul 14, 2020

jfsiii commented Jul 14, 2020

roncohen commented Jul 14, 2020

jfsiii commented Jul 14, 2020

Choose a reason for hiding this comment

roncohen Jul 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roncohen commented Jul 14, 2020

ph commented Jul 14, 2020

jfsiii commented Jul 14, 2020

jen-huang left a comment

Choose a reason for hiding this comment

jfsiii commented Jul 14, 2020

kibanamachine commented Jul 14, 2020

💚 Build Succeeded

Build metrics

History

jfsiii commented Jul 13, 2020 •

edited

Loading

roncohen Jul 14, 2020 •

edited

Loading