Understanding Field capabilities API performance limitations #76509

mattkime · 2021-08-13T16:43:33Z

Since 7.11, Kibana no longer caches field caps api responses for index patterns - elastic/kibana#82223 - the field caps api is called whenever a kibana index pattern is loaded, such as on page load or when navigating between kibana apps.

This works well for the vast majority of our users but there are cases where the field caps response can take >20s. It seems to occur when a large number (thousands) of indices are matched, there are a large number (thousands) of fields, and RBAC is being used. When the response takes too long kibana appears to be broken and can overwhelm a cluster. We consider <1s responses to be performant.

We need someone from the ES team to create a model that describes the performance implications of each of these entities in order to resolve these problems and prevent them from occurring.

Original attempt at this discussion - #59581
Real world kibana use case that drove decision not to cache field caps response - elastic/kibana#71787 (comment)

original-brownbear · 2021-08-14T16:49:58Z

I think the most relevant issue here is #74648 which if implemented would reduce the request count massively in the problematic cases, thus also reducing auth overhead. I think we can close here since we have a path forward for a fix in that issue?

mattkime · 2021-08-15T03:09:18Z

@original-brownbear

Thanks, thats extremely helpful and gives me some optimism.

I'd prefer to keep this issue open as I'd benefit from having a central location where this issue is discussed. I've seen multiple cases where kibana isn't working due to slow field caps responses. It seems that there's been a reasonable amount of activity around resolving this problem that I've been completely unaware of.

deckkh · 2021-08-15T13:07:38Z

we have been hit hard by this issue a lot since we upgraded back in june to 7.11.2 We had a customer with one cluster . which had created an index pattern matching some 6000+ indices. Frequently whenever the custom accessed this index pattern , we would see nodes leave the cluster and in the audit logs we would see a huge spike with FieldCapabilitiesIndexRequest during the crash. That has led us to ask the customer to redesign their index strategy , but that is not something easily done and takes large amount of time. I can provide the Elastic support case id , if you are interested.

elasticmachine · 2021-08-16T14:35:21Z

Pinging @elastic/es-search (Team:Search)

jtibshirani · 2021-08-16T14:49:35Z

I've been intending to look into #74648 for some time -- I'll bump up its priority given the problems we've been facing.

jtibshirani · 2021-08-26T01:35:25Z

In response to @mattkime's request for a model, here's our understanding of the cause of the slowness:

To gather the field caps, the coordinator creates a separate request per index. When RBAC is enabled, each of these index requests need to be authorized. An RBAC authorization request can be expensive (a known issue, tracked in #67987). Even without RBAC there is overhead to having so many individual requests (for example parsing + serialization).
The coordinator must then merge each individual index response. This can take substantial time and memory when there are a large number of indices.

I think we need to look into both sources of slowness to resolve the issue. Here's a proposal:

Move to a node-centric model, to greatly reduce the number of index requests. We'd like to tackle this even if separately we speed up RBAC requests. (#74648)
Look into merging the responses on each node, to prevent the coordinator from doing all merging work. (#82879)
Maybe we could also merge the responses on the remote cluster in the case of CCS. (#78665)
Investigate if we can detect when two indices share the same mappings to cut down on duplicate work. I suspect that when users have tons of indices, these are often time-based and mostly share the same mappings (maybe they're even part of the same datastream).

For context, we've already merged these changes to prevent the slow field caps requests from causing cluster instability. They will be available in 7.14.1 and 7.15. I think we aren't planning more changes from a stability standpoint (@original-brownbear feel free to correct this):

henrikno · 2021-10-02T00:29:37Z

Investigate if we can detect when two indices share the same mappings to cut down on duplicate work. I suspect that when users have tons of indices, these are often time-based and mostly share the same mappings (maybe they're even part of the same datastream).

I think this is going to be a big improvement. We've noticed this in particular on indices that use ECS, such as APM. ECS contains about ~1200 fields, and with ILM rollover it creates quite a few indices, (in particular it's rolling over empty indices for older versions similar to https://github.com/elastic/elasticsearch/issues/733490), so it adds up quickly. But the mapping is 99% the same.

javanna · 2022-01-17T11:21:01Z

@jtibshirani do you think we should open a spin-off meta issue that describes the plan you listed above?

jtibshirani · 2022-01-18T22:09:01Z

@javanna this is a good question. I think we could track this work under our shard scalability efforts: #77466. @original-brownbear -- are there any unresolved items here that you think we should add to the the shard scalability meta issue? If your benchmarks haven't shown that a change is important, we could consider those as "won't fix" for now. (We could still keep #78665 separate, since that's specific to the CCS case).

original-brownbear · 2022-01-19T22:12:11Z

@jtibshirani

are there any unresolved items here that you think we should add to the the shard scalability meta issue?

A field caps request still consumes considerable memory when executed across a large number of indices. Also for nodes holding a lot of indices with large mappings the transport messages shipped around are quite sizeable still. As far as I understand it we could improve both of these issues by doing aggregation of the numbers on the data nodes instead of aggregating all of them on the coordinating node? If that's still a reasonable plan, should I add an issue for that?

I think this is not something to act on this in the next couple of weeks because at the moment cluster size is still practically gated at values that ensure field caps is at < 10s response time, but I wouldn't consider this "won't fix".

original-brownbear · 2022-01-19T22:15:45Z

I should also point out that

Investigate if we can detect when two indices share the same mappings to cut down on duplicate work.

is now relatively trivial to do. We track the sha256 of each mapping (hashed via a relatively stable algorithm) and deduplicate MappingMetadata instances in the clusterstate. See Metadata 's org.elasticsearch.cluster.metadata.Metadata#getMappingsByHash for example. This makes detecting if two indices share the same mapping as cheap as comparing two strings.

jtibshirani · 2022-01-19T22:16:02Z

@original-brownbear sounds good to add an issue for merging responses, thanks! Feel free to ping the search team when you'd like us to pick up work. If everything is tracked in #77466 then I'll close this out.

jtibshirani · 2022-01-19T22:19:41Z

Oops, I commented at the same time as you. Good to know about the mapping hash. I think we could put that to use when collecting and merging responses.

jtibshirani · 2022-01-20T17:17:35Z

Closing in favor of these issues:

The first one is tracked in this meta issue: #77466.

mayya-sharipova added the :Search/Search Search-related issues that do not fall into other categories label Aug 16, 2021

elasticmachine added the Team:Search Meta label for search team label Aug 16, 2021

jtibshirani mentioned this issue Oct 5, 2021

Field caps against many remote clusters consumes substantial heap #78665

Closed

jtibshirani mentioned this issue Jan 20, 2022

Merge field caps responses on each node? #82879

Closed

jtibshirani closed this as completed Jan 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding Field capabilities API performance limitations #76509

Understanding Field capabilities API performance limitations #76509

mattkime commented Aug 13, 2021 •

edited

Loading

original-brownbear commented Aug 14, 2021

mattkime commented Aug 15, 2021 •

edited by javanna

Loading

deckkh commented Aug 15, 2021

elasticmachine commented Aug 16, 2021

jtibshirani commented Aug 16, 2021

jtibshirani commented Aug 26, 2021 •

edited by javanna

Loading

henrikno commented Oct 2, 2021

javanna commented Jan 17, 2022

jtibshirani commented Jan 18, 2022 •

edited

Loading

original-brownbear commented Jan 19, 2022

original-brownbear commented Jan 19, 2022

jtibshirani commented Jan 19, 2022 •

edited

Loading

jtibshirani commented Jan 19, 2022

jtibshirani commented Jan 20, 2022

Understanding Field capabilities API performance limitations #76509

Understanding Field capabilities API performance limitations #76509

Comments

mattkime commented Aug 13, 2021 • edited Loading

original-brownbear commented Aug 14, 2021

mattkime commented Aug 15, 2021 • edited by javanna Loading

deckkh commented Aug 15, 2021

elasticmachine commented Aug 16, 2021

jtibshirani commented Aug 16, 2021

jtibshirani commented Aug 26, 2021 • edited by javanna Loading

henrikno commented Oct 2, 2021

javanna commented Jan 17, 2022

jtibshirani commented Jan 18, 2022 • edited Loading

original-brownbear commented Jan 19, 2022

original-brownbear commented Jan 19, 2022

jtibshirani commented Jan 19, 2022 • edited Loading

jtibshirani commented Jan 19, 2022

jtibshirani commented Jan 20, 2022

mattkime commented Aug 13, 2021 •

edited

Loading

mattkime commented Aug 15, 2021 •

edited by javanna

Loading

jtibshirani commented Aug 26, 2021 •

edited by javanna

Loading

jtibshirani commented Jan 18, 2022 •

edited

Loading

jtibshirani commented Jan 19, 2022 •

edited

Loading