Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ingest Manager] Protect Kibana of spikes cause by Elastic Agent enrollment or checkin. #67987

Closed
ph opened this issue Jun 2, 2020 · 26 comments
Assignees
Labels
Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@ph
Copy link
Contributor

ph commented Jun 2, 2020

When you first enroll a large fleet of Elastic Agents or if your workforce turns on their personal computers within a short timeframe we can risk a stampede of agents talking to Kibana and negatively impact the performance of Kibana.

We have to investigate ways to reduce the problem. For example, we might be able to add protection at the server level to reject many simultaneous incoming connections from Elastic Agent.

@ph ph added Team:Fleet Team label for Observability Data Collection Fleet team Ingest Management:beta1 labels Jun 2, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/ingest-management (Team:Ingest Management)

@ph ph assigned jfsiii Jun 4, 2020
@kobelb
Copy link
Contributor

kobelb commented Jun 4, 2020

We've traditionally recommended that users deploy Kibana in a high-availability deployment and perform this type of logic at the load balancer. However, I totally get why this is inconvenient, specifically for Fleet.

Do Agents access Fleet specific APIs in Kibana, or are they also consuming "general purpose APIs", which have other consumers?

@ph
Copy link
Contributor Author

ph commented Jun 5, 2020

@kobelb Agents target very specific Agent-only API endpoints only available in our plugins I believe this will also be the same for endpoint if they call kibana directly. @kevinlog

@kobelb
Copy link
Contributor

kobelb commented Jun 16, 2020

I think we should add a setting, specified in the kibana.yml, which limits how many Agents can concurrently be connected to a single instance of Kibana. When this threshold is crossed, Kibana should begin responding with a 429 - Too Many Requests

While I wish that there was some intelligent way for Kibana to determine when it's being overloaded by requests from Agents, and begin replying with a 429, I don't think this is feasible. The number of Agents which an instance of Kibana can support is highly dependent on the size of the Elasticsearch cluster it's connected to, the load on the Elasticsearch cluster, the resources (cpu, memory, file descriptors) available to Kibana itself, and the other requests and background tasks which Kibana is currently running. In short, "it depends...".

I'm not aware of anywhere else in Kibana where we wish to limit the number of concurrent requests to a group of APIs, so it doesn't feel like this is something we should generalize to the Kibana Platform. @joshdover have you heard of any similar requests or envision us benefitting from a generalized solution to this problem? A generalized solution potentially becomes even harder if the Agents use long-polling, as we should have a limit for the number of Agents which should have long-polling connections established, but we also potentially should be queuing or limiting the requests internal to the long-polling connections to prevent these from overwhelming Kibana.

@jfsiii
Copy link
Contributor

jfsiii commented Jun 17, 2020

@kobelb I realize Fleet is use Kibana in a different way than is typical but I think the stress testing is highlighting issues which affect all/many plugins. The gains @rudolf got from #68221 make things faster for everyone.

One option that I've mentioned while talking with @joshdover and @rudolf is using Hapi's server.options.load feature which will 503 based on certain criteria including max number of concurrent connections and event loop delay.

I played with some event loop values and it definitely helps when waves of requests come in. The stack monitoring view (or maybe something elsewhere) handled the 503 without any code changes
Screen Shot 2020-05-28 at 8 57 35 PM

I think Platform applying (or providing) some guardrails / circuit-breakers makes sense and would be welcome. A default (max?) event loop delay of 500ms or 1s seems reasonable. Thekibana.yml setting for you mentioned seems to align as well.

I can open or comment more on a separate ticket if you like.

@kobelb
Copy link
Contributor

kobelb commented Jun 17, 2020

One option that I've mentioned while talking with @joshdover and @rudolf is using Hapi's server.options.load feature which will 503 based on certain criteria including max number of concurrent connections and event loop delay.

Exposing some of these settings via equivalent kibana.yml settings seems reasonable. However, we'll want to be careful... Hapi version 19 removed server.options.load.concurrent because it was infrequently used. We treat Kibana's use of Hapi as an "implementation detail", so whatever settings we expose to the user we should be supporting for the entire major version, regardless of what Hapi does.

Finding reasonable defaults for these global load settings is potentially challenging as well. For example, it really depends on the usage pattern of Kibana for what event loop delay should prevent the HTTP server from accepting all new HTTP requests. If the event loop delay is being caused by background tasks in Kibana, we risk not accepting any new HTTP requests. This is potentially problematic because it prevents them from using Kibana itself to determine what is going awry. The user is potentially willing to accept a one second delay imposed by the event loop as long as they can continue using Kibana.

A similar situation applies to the number of concurrent requests, which we potentially don't want to expose because Hapi dropped support for it. It really depends on the resources available to Kibana for how many concurrent requests are acceptable.

I also don't think we'd want to solely rely on settings that apply to all of Kibana's HTTP traffic in this situation. Kibana commonly serves HTTP requests for consumers besides Fleet. In this configuration, we'll want to prevent Fleet from starving Kibana of the resources that it requires to continue serving end-user initiated HTTP requests.

@jfsiii
Copy link
Contributor

jfsiii commented Jun 17, 2020

We treat Kibana's use of Hapi as an "implementation detail", so whatever settings we expose to the user we should be supporting for the entire major version, regardless of what Hapi does.

I totally understand. I didn't mention it in this comment, but I've said as much when talking to Josh & Rudolf.

I also don't think we'd want to solely rely on settings that apply to all of Kibana's HTTP traffic in this situation. Kibana commonly serves HTTP requests for consumers besides Fleet. In this configuration, we'll want to prevent Fleet from starving Kibana of the resources that it requires to continue serving end-user initiated HTTP requests.

This is node, so we're all in the same event loop together. I'm hoping to avoid new settings (event loop delay, concurrent connections) and new behavior for existing flows (returning 429 or 503) and concentrate on identifying bottlenecks that affect everyone so we make things fast(er) by default.

@joshdover
Copy link
Contributor

joshdover commented Jun 17, 2020

@joshdover have you heard of any similar requests or envision us benefitting from a generalized solution to this problem?

I'm not aware of any other plugins with similar requirements.

I also don't think we'd want to solely rely on settings that apply to all of Kibana's HTTP traffic in this situation. Kibana commonly serves HTTP requests for consumers besides Fleet. In this configuration, we'll want to prevent Fleet from starving Kibana of the resources that it requires to continue serving end-user initiated HTTP requests.

I think we could experiment with implementing a wrapper for the IRouter interface that would implement these safeguards for a subset of routes (vs. the entire application). Of course any event loop delay-based check will be affected by other components, but we could at least isolate the affected routes to just Fleet for the time being. A concurrent connection-based check should work well in a wrapper.

This would allow us to experiment with these safeguards and finding good defaults without affecting the rest of Kibana. One reason to not apply these safeguards to all of Kibana is that some APIs should not be affected by these at all. A good example is the /api/status endpoint which serves as a health-check for load balancers. By implementing these as a wrapper, we can isolate them to specific endpoints and then re-evaluate how to roll these out to the rest of Kibana later.

@kobelb
Copy link
Contributor

kobelb commented Jun 17, 2020

The more that I've thought about it, I wouldn't recommend using the event-loop delay as a signal for when Kibana should start rejecting HTTP requests, even just for Agents. Any number of factors could lead to the event-loop being delayed, and there's no certainty that rejecting HTTP requests will help remedy the situation. If Kibana doesn't fall-over and is able to eventually reply to the HTTP request, even when Kibana is under heavy load, it creates a better experience for all clients.

Are you all seeing the Kibana process crash when there are too many Elastic Agents enrolling or checking in?

@roncohen
Copy link
Contributor

@kobelb I don't think we've seen an outright crash, but Kibana becomes overloaded to the point where it's no longer possible to use it.

@jfsiii to start with something simple, could we "just" count how many enrollments there is currently ongoing and return a 429 if the number of concurrent enrollment requests goes above X? X could be a fleet configuration setting in kibana.yml.

We could do the same for the "check in" part of the check in route. After we've handled the check in request and the Agent is in the long polling state, we should decrement the counter to allow for more agents to check in. And then, we also need to teach Elastic Agent to add a reasonable amount of jitter to its retry. It's not ideal, but maybe it's a good enough start?

cc @nchaulet

@kobelb
Copy link
Contributor

kobelb commented Jun 30, 2020

@kobelb I don't think we've seen an outright crash, but Kibana becomes overloaded to the point where it's no longer possible to use it.

Despite Kibana becoming overloaded to the point when it's no longer possible to use it, I think this is a lot better than Kibana crashing. If Kibana is deployed in a high-availability deployment behind a load-balancer which properly does health-checks, the load balancer should be able to detect that Kibana has become overloaded and either redirect or reject new HTTP requests if there is no more capacity.

@roncohen
Copy link
Contributor

roncohen commented Jul 1, 2020

If Kibana is deployed in a high-availability deployment behind a load-balancer which properly does health-checks, the load balancer should be able to detect that Kibana has become overloaded and either redirect or reject new HTTP requests if there is no more capacity.

Yes and that would be a bad experience :)
The load balancers job is to more or less equally distribute the load between Kibanas. If one of your Kibanas is overloaded due to a horde of Agents trying to enroll or check in, likely all the Kibanas behind the load balancer are in the same situation. If you're running Kibana behind a load balancer which does health-checks, the load balancer will remove Kibanas one by one, each time making the load on the remaining Kibanas worse.

Meanwhile, users cannot use Kibana and the whole system will feel "frozen up". Only if you go and kill your agents or dramatically scale up the number of Kibanas, will you get control back. If you've just rolled out Agent on 20k desktop machines, it's not necessarily a simple task to roll those back.

@kobelb
Copy link
Contributor

kobelb commented Jul 1, 2020

Yes and that would be a bad experience :)
The load balancers job is to more or less equally distribute the load between Kibanas. If one of your Kibanas is overloaded due to a horde of Agents trying to enroll or check in, likely all the Kibanas behind the load balancer are in the same situation. If you're running Kibana behind a load balancer which does health-checks, the load balancer will remove Kibanas one by one, each time making the load on the remaining Kibanas worse.

Meanwhile, users cannot use Kibana and the whole system will feel "frozen up". Only if you go and kill your agents or dramatically scale up the number of Kibanas, will you get control back. If you've just rolled out Agent on 20k desktop machines, it's not necessarily a simple task to roll those back.

I agree. There are a few ways in which we can prevent this from happening, and I stand by the mitigation and reasoning I previously expressed in #67987 (comment). Perhaps we're in agreement here, and I'm just misunderstanding.

@roncohen
Copy link
Contributor

roncohen commented Jul 1, 2020

@kobelb OK. We'd like to do something for 7.9. In theory we can just do something on the fleet side, but there, we have no control over checking the API Keys. That is, ideally, we'd reject traffic before Kibana starts checking the API Key for it to be as cheap as possible. Could you work with us to allow us to hook in at the right place?

@ph
Copy link
Contributor Author

ph commented Jul 1, 2020

That is, ideally, we'd reject traffic before Kibana starts checking the API Key for it to be as cheap as possible. Could you work with us to allow us to hook in at the right place?

That would be before fleet/kibana code? It needs to be really high in the request execution.

@kobelb
Copy link
Contributor

kobelb commented Jul 1, 2020

I threw up a quick proof-of-concept here. There are some open questions, which I've asked various Kibana platform team members for their feedback on.

@ph
Copy link
Contributor Author

ph commented Jul 8, 2020

@jfsiii Did you do some test with the above proof-of-concept?

@jfsiii
Copy link
Contributor

jfsiii commented Jul 8, 2020

@ph I did and am looking into some issues I found.

My wave-of-connnections test eventually got to a place where we were always over the limit. I think it has to do with catching/decrementing on errors but didn't get far into it. The example is clear/simple so it shouldn't be bad.

There was also an increase in the event loop lag, but I wanted to get it running correctly before I put to much thought into what I see in Monitoring.

@ph
Copy link
Contributor Author

ph commented Jul 8, 2020

@jfsiii it is optimistic to say that we can have something simple for 7.9?

@jfsiii
Copy link
Contributor

jfsiii commented Jul 8, 2020

@ph I think we can have something that's running as kobelb intended very shortly (in time for 7.9). However, one key aspect of that PR is a the preAuth change in core , or the preRouting PR based on it, so that's up to Platform.

I'll do a version which uses the existing platform code, but ideally we'd also have the onPreAuth change as well.

@jfsiii
Copy link
Contributor

jfsiii commented Jul 9, 2020

@ph I put a draft PR up at #71221 but the connections count doesn't return to 0. I've pinged some people for additional 👀

@jfsiii
Copy link
Contributor

jfsiii commented Jul 13, 2020

@ph #71221 uses the new/updated lifecycle methods from Platform but a) there's some small improvements still pending from Platform b) I need to push the code I cleaned up from the POC.

To ensure we have some mechanism in place for 7.9, I'll open a new PR today which does the same count/rejection using existing lifecycle methods. That way our team can approve & merge it before FF.

@ph
Copy link
Contributor Author

ph commented Jul 13, 2020

Thanks @jfsiii

@jfsiii
Copy link
Contributor

jfsiii commented Jul 13, 2020

@ph Platform merged the updated lifecycle method so I put up a PR based on that #71552

@jfsiii
Copy link
Contributor

jfsiii commented Jul 14, 2020

I closed #71221 in favor of #71552 I'll merge that PR as soon as CI is green which will be before 7.9 FF

@roncohen
Copy link
Contributor

this is done. Let's open separate issues for any follow-ups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

No branches or pull requests

6 participants