[RFC] Node.js clustering in Kibana #94057

pgayvallet · 2021-03-09T09:42:03Z

Summary

This RFC(-ish) describes the proposed implementation and the required code changes to have Node.js clustering available in Kibana.

Original issue: #68626
POC PR: #93380

Rendered RFC: https://github.com/pgayvallet/kibana/blob/kbn-rfc-clustering/rfcs/text/0020_nodejs_clustering.md

elasticmachine · 2021-03-09T09:53:59Z

Pinging @elastic/kibana-core (Team:Core)

rudolf · 2021-03-09T11:31:43Z

rfcs/text/0015_nodejs_clustering.md

+
+![image](../images/15_clustering/perf-4-workers.png)
+
+### Analysis 


Maybe you could explain a bit more how our benchmark test works? My understanding is that they perform a single flow and measure the response times for completing that flow. This is a good smoke test to see what performance a single user would observe on a mostly idle system. However, does our benchmarking run enough parallel requests to start maxing out a single and even more so, multiple core's?

Unless our benchmark is causing the node process to consume so much CPU that we start seeing delays in the event-loop (and some requests start hitting a timeout) we're unlikely to see the true performance gain clustering could achieve.

Although faster response time is a good sign that Kibana is able to handle more, we probably want to measure the request per second throughput.

Although faster response time is a good sign that Kibana is able to handle more, we probably want to measure the request per second throughput.

This is not how gatling works from what I understand. The best we can do would be increasing the number of performed requests per batch and observe the difference in response time.

cc @dmlemeshko wdty?

From posted screenshot I think @pgayvallet used DemoJourney simulation that runs sequence of API calls for each virtual user:

keeping 20 concurrent users for first 3 minutes

increasing from 20 to 50 concurrent users within next 3 minutes

Gatling spins up virtual users based on simulation and tracks request time for each API call. It has no APM-like functionality. The full html report have other charts that could give more information, but not kibana req/sec bandwidth.

Running 200 concurrent users should give more clear diff in reports.

I don't know anything about our setup but it seems like gattling should be able to give us requests succeeded vs requests failed (OK vs KO) https://gatling.io/docs/current/general/reports/

If we can get this metric we should push concurrent users up until we start seeing at least some failures (like maybe 1-5%).

We would then have to double the concurrent users when trying with two workers and compare the number of successful requests against the one worker scenario.

FWIW, when doing performance testing for Fleet, I had to do what @rudolf suggested and manually increase the concurrent users to find the breaking point.

rfcs/text/0015_nodejs_clustering.md

rudolf · 2021-03-09T11:50:57Z

rfcs/text/0015_nodejs_clustering.md

+any sense to have the coordinator recreate them indefinitely, as the error requires manual intervention. 
+
+Should we try to distinguish recoverable and non-recoverable errors, or are we good terminating the main Kibana process 
+when any worker terminates unexpectedly for any reason (After all, this is already the behavior in non-cluster mode)?


the advantage of having the coordinator restart the worker is that it's much faster to recover from an unhandled exception than restarting all of kibana (and doing config validation, migrations checks, etc).

However, it feels like this could be a second phase and we can start by simply killing all workers if one throws an unhandled exception since this should be rare.

In my experience, there are different type of events, and they need to be handled from both ends:

Sometimes, the IPC channel is closed due to the host being overloaded. They usually auto-heal, but if the coordinator exits during that disconnection, the worker won't be stopped and we'll get a zombie process. If I'm recalling correctly, that's a disconnect event that needs to be handled by the worker to self-kill itself.

On exit, there's the exit code and the exitAfterDisconnect flag that could help with identifying if it's a broken process or anything intentional.

But I think it's worth considering what @rudolf says: probably in conjunction with the graceful shutdowns: i.e.: one worker fails, we send the kill signal so all the other workers gracefully stop before being killed.

Current proposal is to kill all processes as Rudolf describes, with worker restarts happening in a follow-up phase after we make a plan for identifying recoverable vs non-recoverable errors.

rfcs/text/0015_nodejs_clustering.md

afharo · 2021-03-09T18:36:36Z

rfcs/text/0015_nodejs_clustering.md

+
+Notes:
+- What should be the default value for `clustering.workers`? We could go with `Max(1, os.cpus().length - 1)`, but do we really want to use all cpus by default,
+  knowing that every worker is going to have its own memory usage.


Based on the benchmarks, do nodes use less memory when they split the load?

++ to this question. Also curious what Elasticsearch does here, it's possible we can follow a similar pattern. If we can minimize memory overhead (see comment above), it'd be great to have an sensible automatic scaling default.

The process' RSS memory will grow with the number of open requests. So, theoretically, for a given amount of requests per second, more workers will require less memory each. However, the garbage collector might not collect as aggressively if the RSS is below the maximum heap. And even if there's a trend in our benchmarks it doesn't mean that there won't be spikes, if a given worker handles a request to export a large enough amount of saved objects that worker will consume all of it's heap and adding more workers won't improve this.

So in practice I think we should ignore any memory benchmarks and create the expectation with users that each worker will use up --max-old-space-size (or the default for their system).

do we really want to use all cpus by default

It feels like there really isn't a good way to automatically choose the correct value here. Perhaps we should just always require that the user sets this in configuration if they want clustering enabled?

rfcs/text/0015_nodejs_clustering.md

afharo · 2021-03-09T19:05:22Z

rfcs/text/0015_nodejs_clustering.md

+any sense to have the coordinator recreate them indefinitely, as the error requires manual intervention. 
+
+Should we try to distinguish recoverable and non-recoverable errors, or are we good terminating the main Kibana process 
+when any worker terminates unexpectedly for any reason (After all, this is already the behavior in non-cluster mode)?


In my experience, there are different type of events, and they need to be handled from both ends:

Sometimes, the IPC channel is closed due to the host being overloaded. They usually auto-heal, but if the coordinator exits during that disconnection, the worker won't be stopped and we'll get a zombie process. If I'm recalling correctly, that's a disconnect event that needs to be handled by the worker to self-kill itself.

On exit, there's the exit code and the exitAfterDisconnect flag that could help with identifying if it's a broken process or anything intentional.

But I think it's worth considering what @rudolf says: probably in conjunction with the graceful shutdowns: i.e.: one worker fails, we send the kill signal so all the other workers gracefully stop before being killed.

rfcs/text/0015_nodejs_clustering.md

joshdover · 2021-03-10T09:01:28Z

rfcs/text/0015_nodejs_clustering.md

+
+Notes:
+- What should be the default value for `clustering.workers`? We could go with `Max(1, os.cpus().length - 1)`, but do we really want to use all cpus by default,
+  knowing that every worker is going to have its own memory usage.


++ to this question. Also curious what Elasticsearch does here, it's possible we can follow a similar pattern. If we can minimize memory overhead (see comment above), it'd be great to have an sensible automatic scaling default.

rfcs/text/0015_nodejs_clustering.md

joshdover · 2021-03-10T09:35:35Z

rfcs/text/0015_nodejs_clustering.md

+- Implementation cost is going to be significant, both in core and in plugins. Also, this will have to be a collaborative
+  effort, as we can't enable the clustered mode in production until all the identified breaking changes have been addressed.
+
+- Even if easier to deploy, it doesn't really provide anything more than a multi-instances Kibana setup.


Do we have any side-by-side comparisons of the performance of two modes?

it doesn't really provide anything more than a multi-instances Kibana setup.

Kibana does a lot of background polling work which creates a significant amount of network traffic. If we centralize all this background polling to a single worker we can reduce this background work by a factor equal to the number of worker processes (maybe that would be a factor of 8 or 16 for most deployments?).

It's probably worth trying to quantify the impact, e.g. reducing 1Kb of traffic per month by a factor of 16 is negligible but reducing 10Gb by a factor of 16 would be a big win.

If we centralize all this background polling to a single worker we can reduce this background work by a factor equal to the number of worker processes

That sounds great, but would also require more changes from plugins though.

E.g we could have only one worker perform the license check calls, and then broadcast the result to the other workers, but that means changing the licensing plugin implementation.

All these optimization can easily be done later as follow-ups though.

It's probably worth trying to quantify the impact

We would need to list all the 'background' requests we are performing against ES. Not sure how to do that without help from the other teams though.

Agreed, but I think it's worth adding this as a benefit to clustering.

This is listed as a potential future benefit of clustering, and the current RFC doesn't preclude us from doing this. However the current proposal is to treat it as an enhancement and not address in the initial implementation.

mshustov · 2021-03-10T08:38:05Z

rfcs/text/0015_nodejs_clustering.md

+
+- Between non-clustered and 2-worker cluster mode, we observe a 20/25% gain in the 50th percentile response time. 
+  Gain for the 75th and 95th are between 10% and 40%
+- Between 2-worker and 4-workers cluster mode, the gain on 50th is negligible, but the 75th and the 95th are 


it might be interesting to compare with a setup containing several Kibana instances with a load balancer in front of them.

rfcs/text/0015_nodejs_clustering.md

LeeDr · 2021-03-17T22:55:01Z

@dmlemeshko please correct me if I mis-speak here as I'm just referring to your work.

One testing issue we're still working on is getting Elasticsearch and Kibana monitoring data while running the Gatling tests. Some early results on Cloud seemed to indicate that Elasticsearch was the bottleneck once we hit a certain number of users with Dima's demo scenario. Once the search queue builds up and we start getting searches rejected, there's nothing Kibana could do to support more users.
Also note that in those tests there was NO INJESTION of data into Elasticsearch. It was purely queries against a small sample data set. Adding injestion would have reduced the throughput of Kibana even more.

I'm sure there are ways to scale Elasticsearch up to handle more requests/sec from Kibana but it wasn't as simple as doubling the number of Elasticsearch nodes.

Running a similar Gatling test on a Jenkins machine with Elasticsearch and Kibana on the same host had significantly higher throughput (less latency, maybe higher performance machine). But I'm still not sure if Elasticsearch or Kibana is the limiting factor in this case. I think we really need to have a load test case where we know Kibana is the bottleneck by a good margin and then compare Kibana with/without clustering.

The Read Threads chart below shows the Search Rejections hit about 245 (at 10 sec intervals) during a Gatling load test;

Here's a Visualization of the Gatling data during the same run. In this case I'm not seeing the response times notable get worse corresponding to the search rejections, but I think that's because Dima backed the number of users down. I think when he added more users you could see an obvious impact.

rfcs/text/0015_nodejs_clustering.md

kobelb · 2021-03-31T23:07:54Z

rfcs/text/0015_nodejs_clustering.md

+
+![image](../images/15_clustering/perf-4-workers.png)
+
+### Analysis 


FWIW, when doing performance testing for Fleet, I had to do what @rudolf suggested and manually increase the concurrent users to find the breaking point.

rfcs/text/0015_nodejs_clustering.md

kobelb · 2021-03-31T23:56:35Z

rfcs/text/0015_nodejs_clustering.md

+
+One pragmatic solution could be, when clustering is enabled, to create a sub folder under path.data for each worker. 
+
+The data folder is not considered part of our public API, and the implementation and path already changed in previous


Really? How did it change in previous minor releases?

This statement is coming from @joshdover, I'll let him answer if he remembers

I'm not sure what I said and I'm having a hard time finding an instance where we changed this, I must have misspoke!

However, I don't believe we consider the structure of the data folder to be part of our public API. We should be able to use the sub-folder approach if needed I believe.

What all is stored in the data folder? Is it just the uuid file? The docs are not really clear...

path.data: The path where Kibana stores persistent data not saved in Elasticsearch. Default: data

Source: https://www.elastic.co/guide/en/kibana/current/settings.html

And then they're contradicted by the package directory layout

data The location of the data files written to disk by Kibana and its plugins /var/lib/kibana

path.data logs Logs files location

path.logs plugins Plugin files location. Each plugin will be contained in a subdirectory.

Source: https://www.elastic.co/guide/en/kibana/current/rpm.html#rpm-layout

The only plugin that I have found that is referencing this is the reporting plugin, however I can't find a place in code the actually reads the configuration. @elastic/kibana-reporting-services Could you provide any guidance into how the path.data config is used by the Reporting plugin? My guess is as some sort of temporary storage, however I can't find any usages.

Pinging @elastic/kibana-app-services for any guidance on what reporting does here.

I think we used to use it for the "user data directory" for Chromium, but as I look in the code now, we just use tmpdir from the os package: https://github.com/elastic/kibana/blob/master/x-pack/plugins/reporting/server/browsers/chromium/driver_factory/index.ts#L51

rfcs/text/0015_nodejs_clustering.md

streamich

Do we want to differentiate between "clustered" and "non-clustered" modes? Could we go for the Cloud-first approach and only have "clustered" mode, if somebody does not need it (say a Docker container), they just spin up a single Kibana worker.

rfcs/text/0015_nodejs_clustering.md

lukeelmers · 2021-06-08T21:22:37Z

Do we want to differentiate between "clustered" and "non-clustered" modes?

@streamich The current proposal is to not differentiate between the two modes, so that (outside of the coordinator process) the rest of Kibana doesn't need to be aware of the mode it is running in. Just needs to know if it is the "main" worker or not (which would always return true in non-cluster mode).

Could we go for the Cloud-first approach and only have "clustered" mode, if somebody does not need it (say a Docker container), they just spin up a single Kibana worker.

Yes, the idea is to get to a place where we are using clustering by default based on the # of CPUs available (currently this is Phase 3 in the proposed rollout). Then users who don't want it can opt-out by modifying the config. It's still up for debate whether we make it the default from the beginning, or have a period of time where it is opt-in.

EDIT: Realizing now I think you're asking the same question as Josh asked here -- see that thread for more discussion.

tsullivan · 2021-06-10T00:37:34Z

rfcs/text/0015_nodejs_clustering.md

+
+An example would be Reporting's queueFactory pooling. As we want to only be running a single headless at a time per 
+Kibana instance, only one worker should have pooling enabled.
+


we want to only be running a single headless at a time per Kibana instance

If we didn't do anything, the number of Chromium processes running would always be <= to the number of cores. That seems OK to me.

joshdover · 2021-06-14T11:40:54Z

From a Core perspective, I think this RFC is ✅ once it's been updated based on the most recent discussions above.

lukeelmers · 2021-06-15T20:25:14Z

@joshdover @pgayvallet @tsullivan @streamich @mikecote, et al: I've pushed a batch of updates based on the last round of feedback. Please take a look at let me know if anything important is missing.

rfcs/text/0015_nodejs_clustering.md

lukeelmers · 2021-06-17T22:49:53Z

Okay folks, I'm going to go ahead and move this into a final comment period. If you have any major concerns/objections, please speak up by the end of next week (Friday 25 June, 2021).

pgayvallet

(Can't approve because I'm still the initial author of the PR, but) LGTM

rfcs/text/0015_nodejs_clustering.md

initial conversion

9435b3e

pgayvallet added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v8.0.0 release_note:skip Skip the PR/issue when compiling release notes RFC labels Mar 9, 2021

add RFC PR link

e977dba

pgayvallet mentioned this pull request Mar 9, 2021

Use Nodejs clustering to take advantage of multi-core hardware #68626

Closed

21 tasks

pgayvallet marked this pull request as ready for review March 9, 2021 09:53

pgayvallet added 3 commits March 9, 2021 11:11

The missing chapters

22c8fbf

more details on the clustering service

e1cfb58

self review

4603408

rudolf reviewed Mar 9, 2021

View reviewed changes

tylersmalley changed the title ~~[RFC] NodeJS clustering in Kibana~~ [RFC] Node.js clustering in Kibana Mar 9, 2021

afharo reviewed Mar 9, 2021

View reviewed changes

vigneshshanmugam reviewed Mar 9, 2021

View reviewed changes

rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved

joshdover reviewed Mar 10, 2021

View reviewed changes

mshustov reviewed Mar 10, 2021

View reviewed changes

pgayvallet mentioned this pull request Mar 22, 2021

Remove circular dependencies between CLI & core #76935

Closed

pgayvallet added 2 commits March 22, 2021 10:50

Merge remote-tracking branch 'upstream/master' into kbn-rfc-clustering

0ee34e7

address some review comments

37328e9

rudolf reviewed Mar 22, 2021

View reviewed changes

rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved

pgayvallet added 4 commits March 22, 2021 14:31

more nits

c73c12d

add alternative for rolling file

0b22f48

add part of telemetry

8b24bae

rephrasing

f8d7540

joshdover mentioned this pull request Mar 23, 2021

[Meta] Kibana platform performance #63848

Closed

29 tasks

kobelb reviewed Mar 31, 2021

View reviewed changes

lukeelmers mentioned this pull request Jun 8, 2021

[clustering] Write performance benchmark tests #101556

Closed

3 tasks

streamich reviewed Jun 8, 2021

View reviewed changes

rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved

lukeelmers mentioned this pull request Jun 9, 2021

RFC: Kibana preboot lifecycle stage. #99318

Merged

tsullivan reviewed Jun 10, 2021

View reviewed changes

lukeelmers added 2 commits June 15, 2021 12:12

Update based on latest feedback.

9c4df78

Update sections on status and stats

b1afaa1

lukeelmers reviewed Jun 15, 2021

View reviewed changes

rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved

lukeelmers mentioned this pull request Jun 17, 2021

[Telemetry] Track event loop delays on the server #101580

Merged

tsullivan approved these changes Jun 17, 2021

View reviewed changes

lukeelmers added the RFC/final-comment-period If no concerns are raised in 3 business days this RFC will be accepted label Jun 17, 2021

Update based on resolved questions.

6fcc6a4

pgayvallet commented Jun 28, 2021

View reviewed changes

mshustov reviewed Jun 28, 2021

View reviewed changes

rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved

mshustov reviewed Jun 28, 2021

View reviewed changes

rfcs/text/0015_nodejs_clustering.md Outdated Show resolved Hide resolved

mshustov approved these changes Jun 28, 2021

View reviewed changes

lukeelmers added 2 commits June 28, 2021 07:52

Rename file.

0b63a1c

Clarify language on breaking changes + 1 more file rename.

de36e23

lukeelmers added the backport:skip This commit does not require backporting label Jun 28, 2021

lukeelmers merged commit 62e79ff into elastic:master Jun 28, 2021

lukeelmers mentioned this pull request Jun 28, 2021

Convert image names to snake_case in RFC 0020. #103536

Merged

spalger mentioned this pull request Jun 28, 2021

[discuss] Always run casing check on CI? Ignore casing on documentation-specific files? #103545

Closed

This was referenced Jun 30, 2021

[core.metrics] Add support for multiple processes in ops metrics & stats API; deprecate process field #104031

Closed

[Breaking][core.metrics] Remove process field from ops metrics & /stats API #104124

Closed

lukeelmers mentioned this pull request Jul 19, 2021

[POC] multi-process Kibana with Node clustering #106055

Closed

jasonrhodes mentioned this pull request Sep 27, 2021

[Stack Monitoring] Adapt UI to processes metrics from kibana 8.0 #113149

Closed


		![image](../images/15_clustering/perf-4-workers.png)

		### Analysis


		One pragmatic solution could be, when clustering is enabled, to create a sub folder under path.data for each worker.

		The data folder is not considered part of our public API, and the implementation and path already changed in previous

data	The location of the data files written to disk by Kibana and its plugins	/var/lib/kibana
path.data	logs	Logs files location
path.logs	plugins	Plugin files location. Each plugin will be contained in a subdirectory.


		An example would be Reporting's queueFactory pooling. As we want to only be running a single headless at a time per
		Kibana instance, only one worker should have pooling enabled.

[RFC] Node.js clustering in Kibana #94057

[RFC] Node.js clustering in Kibana #94057

Conversation

pgayvallet commented Mar 9, 2021 • edited by lukeelmers Loading

Summary

elasticmachine commented Mar 9, 2021

Choose a reason for hiding this comment

pgayvallet Mar 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LeeDr commented Mar 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

streamich left a comment

Choose a reason for hiding this comment

lukeelmers commented Jun 8, 2021 • edited Loading

Choose a reason for hiding this comment

joshdover commented Jun 14, 2021

lukeelmers commented Jun 15, 2021

lukeelmers commented Jun 17, 2021

pgayvallet left a comment

Choose a reason for hiding this comment

pgayvallet commented Mar 9, 2021 •

edited by lukeelmers

Loading

pgayvallet Mar 9, 2021 •

edited

Loading

lukeelmers commented Jun 8, 2021 •

edited

Loading