Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Node.js clustering in Kibana #94057

Merged
merged 19 commits into from
Jun 28, 2021

Conversation

pgayvallet
Copy link
Contributor

@pgayvallet pgayvallet commented Mar 9, 2021

Summary

This RFC(-ish) describes the proposed implementation and the required code changes to have Node.js clustering available in Kibana.

Original issue: #68626
POC PR: #93380

Rendered RFC: https://github.com/pgayvallet/kibana/blob/kbn-rfc-clustering/rfcs/text/0020_nodejs_clustering.md

@pgayvallet pgayvallet added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v8.0.0 release_note:skip Skip the PR/issue when compiling release notes RFC labels Mar 9, 2021
@pgayvallet pgayvallet marked this pull request as ready for review March 9, 2021 09:53
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)


![image](../images/15_clustering/perf-4-workers.png)

### Analysis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could explain a bit more how our benchmark test works? My understanding is that they perform a single flow and measure the response times for completing that flow. This is a good smoke test to see what performance a single user would observe on a mostly idle system. However, does our benchmarking run enough parallel requests to start maxing out a single and even more so, multiple core's?

Unless our benchmark is causing the node process to consume so much CPU that we start seeing delays in the event-loop (and some requests start hitting a timeout) we're unlikely to see the true performance gain clustering could achieve.

Although faster response time is a good sign that Kibana is able to handle more, we probably want to measure the request per second throughput.

Copy link
Contributor Author

@pgayvallet pgayvallet Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although faster response time is a good sign that Kibana is able to handle more, we probably want to measure the request per second throughput.

This is not how gatling works from what I understand. The best we can do would be increasing the number of performed requests per batch and observe the difference in response time.

cc @dmlemeshko wdty?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From posted screenshot I think @pgayvallet used DemoJourney simulation that runs sequence of API calls for each virtual user:

  • keeping 20 concurrent users for first 3 minutes
  • increasing from 20 to 50 concurrent users within next 3 minutes

Gatling spins up virtual users based on simulation and tracks request time for each API call. It has no APM-like functionality. The full html report have other charts that could give more information, but not kibana req/sec bandwidth.

Running 200 concurrent users should give more clear diff in reports.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know anything about our setup but it seems like gattling should be able to give us requests succeeded vs requests failed (OK vs KO) https://gatling.io/docs/current/general/reports/

If we can get this metric we should push concurrent users up until we start seeing at least some failures (like maybe 1-5%).

We would then have to double the concurrent users when trying with two workers and compare the number of successful requests against the one worker scenario.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, when doing performance testing for Fleet, I had to do what @rudolf suggested and manually increase the concurrent users to find the breaking point.

rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
any sense to have the coordinator recreate them indefinitely, as the error requires manual intervention.

Should we try to distinguish recoverable and non-recoverable errors, or are we good terminating the main Kibana process
when any worker terminates unexpectedly for any reason (After all, this is already the behavior in non-cluster mode)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the advantage of having the coordinator restart the worker is that it's much faster to recover from an unhandled exception than restarting all of kibana (and doing config validation, migrations checks, etc).

However, it feels like this could be a second phase and we can start by simply killing all workers if one throws an unhandled exception since this should be rare.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience, there are different type of events, and they need to be handled from both ends:

  • Sometimes, the IPC channel is closed due to the host being overloaded. They usually auto-heal, but if the coordinator exits during that disconnection, the worker won't be stopped and we'll get a zombie process. If I'm recalling correctly, that's a disconnect event that needs to be handled by the worker to self-kill itself.
  • On exit, there's the exit code and the exitAfterDisconnect flag that could help with identifying if it's a broken process or anything intentional.

But I think it's worth considering what @rudolf says: probably in conjunction with the graceful shutdowns: i.e.: one worker fails, we send the kill signal so all the other workers gracefully stop before being killed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current proposal is to kill all processes as Rudolf describes, with worker restarts happening in a follow-up phase after we make a plan for identifying recoverable vs non-recoverable errors.

@tylersmalley tylersmalley changed the title [RFC] NodeJS clustering in Kibana [RFC] Node.js clustering in Kibana Mar 9, 2021
rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved

Notes:
- What should be the default value for `clustering.workers`? We could go with `Max(1, os.cpus().length - 1)`, but do we really want to use all cpus by default,
knowing that every worker is going to have its own memory usage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the benchmarks, do nodes use less memory when they split the load?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ to this question. Also curious what Elasticsearch does here, it's possible we can follow a similar pattern. If we can minimize memory overhead (see comment above), it'd be great to have an sensible automatic scaling default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process' RSS memory will grow with the number of open requests. So, theoretically, for a given amount of requests per second, more workers will require less memory each. However, the garbage collector might not collect as aggressively if the RSS is below the maximum heap. And even if there's a trend in our benchmarks it doesn't mean that there won't be spikes, if a given worker handles a request to export a large enough amount of saved objects that worker will consume all of it's heap and adding more workers won't improve this.

So in practice I think we should ignore any memory benchmarks and create the expectation with users that each worker will use up --max-old-space-size (or the default for their system).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really want to use all cpus by default

It feels like there really isn't a good way to automatically choose the correct value here. Perhaps we should just always require that the user sets this in configuration if they want clustering enabled?

rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
any sense to have the coordinator recreate them indefinitely, as the error requires manual intervention.

Should we try to distinguish recoverable and non-recoverable errors, or are we good terminating the main Kibana process
when any worker terminates unexpectedly for any reason (After all, this is already the behavior in non-cluster mode)?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience, there are different type of events, and they need to be handled from both ends:

  • Sometimes, the IPC channel is closed due to the host being overloaded. They usually auto-heal, but if the coordinator exits during that disconnection, the worker won't be stopped and we'll get a zombie process. If I'm recalling correctly, that's a disconnect event that needs to be handled by the worker to self-kill itself.
  • On exit, there's the exit code and the exitAfterDisconnect flag that could help with identifying if it's a broken process or anything intentional.

But I think it's worth considering what @rudolf says: probably in conjunction with the graceful shutdowns: i.e.: one worker fails, we send the kill signal so all the other workers gracefully stop before being killed.

rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved

Notes:
- What should be the default value for `clustering.workers`? We could go with `Max(1, os.cpus().length - 1)`, but do we really want to use all cpus by default,
knowing that every worker is going to have its own memory usage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ to this question. Also curious what Elasticsearch does here, it's possible we can follow a similar pattern. If we can minimize memory overhead (see comment above), it'd be great to have an sensible automatic scaling default.

rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
- Implementation cost is going to be significant, both in core and in plugins. Also, this will have to be a collaborative
effort, as we can't enable the clustered mode in production until all the identified breaking changes have been addressed.

- Even if easier to deploy, it doesn't really provide anything more than a multi-instances Kibana setup.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any side-by-side comparisons of the performance of two modes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't really provide anything more than a multi-instances Kibana setup.

Kibana does a lot of background polling work which creates a significant amount of network traffic. If we centralize all this background polling to a single worker we can reduce this background work by a factor equal to the number of worker processes (maybe that would be a factor of 8 or 16 for most deployments?).

It's probably worth trying to quantify the impact, e.g. reducing 1Kb of traffic per month by a factor of 16 is negligible but reducing 10Gb by a factor of 16 would be a big win.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we centralize all this background polling to a single worker we can reduce this background work by a factor equal to the number of worker processes

That sounds great, but would also require more changes from plugins though.

E.g we could have only one worker perform the license check calls, and then broadcast the result to the other workers, but that means changing the licensing plugin implementation.

All these optimization can easily be done later as follow-ups though.

It's probably worth trying to quantify the impact

We would need to list all the 'background' requests we are performing against ES. Not sure how to do that without help from the other teams though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but I think it's worth adding this as a benefit to clustering.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is listed as a potential future benefit of clustering, and the current RFC doesn't preclude us from doing this. However the current proposal is to treat it as an enhancement and not address in the initial implementation.


- Between non-clustered and 2-worker cluster mode, we observe a 20/25% gain in the 50th percentile response time.
Gain for the 75th and 95th are between 10% and 40%
- Between 2-worker and 4-workers cluster mode, the gain on 50th is negligible, but the 75th and the 95th are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be interesting to compare with a setup containing several Kibana instances with a load balancer in front of them.

rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
@LeeDr
Copy link
Contributor

LeeDr commented Mar 17, 2021

@dmlemeshko please correct me if I mis-speak here as I'm just referring to your work.

One testing issue we're still working on is getting Elasticsearch and Kibana monitoring data while running the Gatling tests. Some early results on Cloud seemed to indicate that Elasticsearch was the bottleneck once we hit a certain number of users with Dima's demo scenario. Once the search queue builds up and we start getting searches rejected, there's nothing Kibana could do to support more users.
Also note that in those tests there was NO INJESTION of data into Elasticsearch. It was purely queries against a small sample data set. Adding injestion would have reduced the throughput of Kibana even more.

I'm sure there are ways to scale Elasticsearch up to handle more requests/sec from Kibana but it wasn't as simple as doubling the number of Elasticsearch nodes.

Running a similar Gatling test on a Jenkins machine with Elasticsearch and Kibana on the same host had significantly higher throughput (less latency, maybe higher performance machine). But I'm still not sure if Elasticsearch or Kibana is the limiting factor in this case. I think we really need to have a load test case where we know Kibana is the bottleneck by a good margin and then compare Kibana with/without clustering.

The Read Threads chart below shows the Search Rejections hit about 245 (at 10 sec intervals) during a Gatling load test;
image

Here's a Visualization of the Gatling data during the same run. In this case I'm not seeing the response times notable get worse corresponding to the search rejections, but I think that's because Dima backed the number of users down. I think when he added more users you could see an obvious impact.
image

@joshdover joshdover mentioned this pull request Mar 23, 2021
29 tasks
rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved

![image](../images/15_clustering/perf-4-workers.png)

### Analysis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, when doing performance testing for Fleet, I had to do what @rudolf suggested and manually increase the concurrent users to find the breaking point.

rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved

One pragmatic solution could be, when clustering is enabled, to create a sub folder under path.data for each worker.

The data folder is not considered part of our public API, and the implementation and path already changed in previous
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really? How did it change in previous minor releases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement is coming from @joshdover, I'll let him answer if he remembers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what I said and I'm having a hard time finding an instance where we changed this, I must have misspoke!

However, I don't believe we consider the structure of the data folder to be part of our public API. We should be able to use the sub-folder approach if needed I believe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What all is stored in the data folder? Is it just the uuid file? The docs are not really clear...

path.data: The path where Kibana stores persistent data not saved in Elasticsearch. Default: data

Source: https://www.elastic.co/guide/en/kibana/current/settings.html

And then they're contradicted by the package directory layout

data The location of the data files written to disk by Kibana and its plugins /var/lib/kibana
path.data logs Logs files location
path.logs plugins Plugin files location. Each plugin will be contained in a subdirectory.

Source: https://www.elastic.co/guide/en/kibana/current/rpm.html#rpm-layout

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only plugin that I have found that is referencing this is the reporting plugin, however I can't find a place in code the actually reads the configuration. @elastic/kibana-reporting-services Could you provide any guidance into how the path.data config is used by the Reporting plugin? My guess is as some sort of temporary storage, however I can't find any usages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinging @elastic/kibana-app-services for any guidance on what reporting does here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we used to use it for the "user data directory" for Chromium, but as I look in the code now, we just use tmpdir from the os package: https://github.com/elastic/kibana/blob/master/x-pack/plugins/reporting/server/browsers/chromium/driver_factory/index.ts#L51

rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
Copy link
Contributor

@streamich streamich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to differentiate between "clustered" and "non-clustered" modes? Could we go for the Cloud-first approach and only have "clustered" mode, if somebody does not need it (say a Docker container), they just spin up a single Kibana worker.

rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved
@lukeelmers
Copy link
Member

lukeelmers commented Jun 8, 2021

Do we want to differentiate between "clustered" and "non-clustered" modes?

@streamich The current proposal is to not differentiate between the two modes, so that (outside of the coordinator process) the rest of Kibana doesn't need to be aware of the mode it is running in. Just needs to know if it is the "main" worker or not (which would always return true in non-cluster mode).

Could we go for the Cloud-first approach and only have "clustered" mode, if somebody does not need it (say a Docker container), they just spin up a single Kibana worker.

Yes, the idea is to get to a place where we are using clustering by default based on the # of CPUs available (currently this is Phase 3 in the proposed rollout). Then users who don't want it can opt-out by modifying the config. It's still up for debate whether we make it the default from the beginning, or have a period of time where it is opt-in.

EDIT: Realizing now I think you're asking the same question as Josh asked here -- see that thread for more discussion.


An example would be Reporting's queueFactory pooling. As we want to only be running a single headless at a time per
Kibana instance, only one worker should have pooling enabled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to only be running a single headless at a time per Kibana instance

If we didn't do anything, the number of Chromium processes running would always be <= to the number of cores. That seems OK to me.

@joshdover
Copy link
Contributor

From a Core perspective, I think this RFC is ✅ once it's been updated based on the most recent discussions above.

@lukeelmers
Copy link
Member

@joshdover @pgayvallet @tsullivan @streamich @mikecote, et al: I've pushed a batch of updates based on the last round of feedback. Please take a look at let me know if anything important is missing.

@lukeelmers
Copy link
Member

Okay folks, I'm going to go ahead and move this into a final comment period. If you have any major concerns/objections, please speak up by the end of next week (Friday 25 June, 2021).

@lukeelmers lukeelmers added the RFC/final-comment-period If no concerns are raised in 3 business days this RFC will be accepted label Jun 17, 2021
Copy link
Contributor Author

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Can't approve because I'm still the initial author of the PR, but) LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:skip Skip the PR/issue when compiling release notes RFC/final-comment-period If no concerns are raised in 3 business days this RFC will be accepted RFC Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.