Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding Field capabilities API performance limitations #76509

Closed
mattkime opened this issue Aug 13, 2021 · 14 comments
Closed

Understanding Field capabilities API performance limitations #76509

mattkime opened this issue Aug 13, 2021 · 14 comments
Labels
:Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@mattkime
Copy link

mattkime commented Aug 13, 2021

Since 7.11, Kibana no longer caches field caps api responses for index patterns - elastic/kibana#82223 - the field caps api is called whenever a kibana index pattern is loaded, such as on page load or when navigating between kibana apps.

This works well for the vast majority of our users but there are cases where the field caps response can take >20s. It seems to occur when a large number (thousands) of indices are matched, there are a large number (thousands) of fields, and RBAC is being used. When the response takes too long kibana appears to be broken and can overwhelm a cluster. We consider <1s responses to be performant.

We need someone from the ES team to create a model that describes the performance implications of each of these entities in order to resolve these problems and prevent them from occurring.


Original attempt at this discussion - #59581
Real world kibana use case that drove decision not to cache field caps response - elastic/kibana#71787 (comment)

@original-brownbear
Copy link
Member

I think the most relevant issue here is #74648 which if implemented would reduce the request count massively in the problematic cases, thus also reducing auth overhead. I think we can close here since we have a path forward for a fix in that issue?

@mattkime
Copy link
Author

mattkime commented Aug 15, 2021

@original-brownbear

Thanks, thats extremely helpful and gives me some optimism.

I'd prefer to keep this issue open as I'd benefit from having a central location where this issue is discussed. I've seen multiple cases where kibana isn't working due to slow field caps responses. It seems that there's been a reasonable amount of activity around resolving this problem that I've been completely unaware of.

@deckkh
Copy link

deckkh commented Aug 15, 2021

we have been hit hard by this issue a lot since we upgraded back in june to 7.11.2 We had a customer with one cluster . which had created an index pattern matching some 6000+ indices. Frequently whenever the custom accessed this index pattern , we would see nodes leave the cluster and in the audit logs we would see a huge spike with FieldCapabilitiesIndexRequest during the crash. That has led us to ask the customer to redesign their index strategy , but that is not something easily done and takes large amount of time. I can provide the Elastic support case id , if you are interested.

@mayya-sharipova mayya-sharipova added the :Search/Search Search-related issues that do not fall into other categories label Aug 16, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Aug 16, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@jtibshirani
Copy link
Contributor

I've been intending to look into #74648 for some time -- I'll bump up its priority given the problems we've been facing.

@jtibshirani
Copy link
Contributor

jtibshirani commented Aug 26, 2021

In response to @mattkime's request for a model, here's our understanding of the cause of the slowness:

  1. To gather the field caps, the coordinator creates a separate request per index. When RBAC is enabled, each of these index requests need to be authorized. An RBAC authorization request can be expensive (a known issue, tracked in #67987). Even without RBAC there is overhead to having so many individual requests (for example parsing + serialization).
  2. The coordinator must then merge each individual index response. This can take substantial time and memory when there are a large number of indices.

I think we need to look into both sources of slowness to resolve the issue. Here's a proposal:

  • Move to a node-centric model, to greatly reduce the number of index requests. We'd like to tackle this even if separately we speed up RBAC requests. (#74648)
  • Look into merging the responses on each node, to prevent the coordinator from doing all merging work. (#82879)
  • Maybe we could also merge the responses on the remote cluster in the case of CCS. (#78665)
  • Investigate if we can detect when two indices share the same mappings to cut down on duplicate work. I suspect that when users have tons of indices, these are often time-based and mostly share the same mappings (maybe they're even part of the same datastream).

For context, we've already merged these changes to prevent the slow field caps requests from causing cluster instability. They will be available in 7.14.1 and 7.15. I think we aren't planning more changes from a stability standpoint (@original-brownbear feel free to correct this):

@henrikno
Copy link
Contributor

henrikno commented Oct 2, 2021

Investigate if we can detect when two indices share the same mappings to cut down on duplicate work. I suspect that when users have tons of indices, these are often time-based and mostly share the same mappings (maybe they're even part of the same datastream).

I think this is going to be a big improvement. We've noticed this in particular on indices that use ECS, such as APM. ECS contains about ~1200 fields, and with ILM rollover it creates quite a few indices, (in particular it's rolling over empty indices for older versions similar to https://github.com/elastic/elasticsearch/issues/733490), so it adds up quickly. But the mapping is 99% the same.

@javanna
Copy link
Member

javanna commented Jan 17, 2022

@jtibshirani do you think we should open a spin-off meta issue that describes the plan you listed above?

@jtibshirani
Copy link
Contributor

jtibshirani commented Jan 18, 2022

@javanna this is a good question. I think we could track this work under our shard scalability efforts: #77466. @original-brownbear -- are there any unresolved items here that you think we should add to the the shard scalability meta issue? If your benchmarks haven't shown that a change is important, we could consider those as "won't fix" for now. (We could still keep #78665 separate, since that's specific to the CCS case).

@original-brownbear
Copy link
Member

@jtibshirani

are there any unresolved items here that you think we should add to the the shard scalability meta issue?

A field caps request still consumes considerable memory when executed across a large number of indices. Also for nodes holding a lot of indices with large mappings the transport messages shipped around are quite sizeable still. As far as I understand it we could improve both of these issues by doing aggregation of the numbers on the data nodes instead of aggregating all of them on the coordinating node? If that's still a reasonable plan, should I add an issue for that?

I think this is not something to act on this in the next couple of weeks because at the moment cluster size is still practically gated at values that ensure field caps is at < 10s response time, but I wouldn't consider this "won't fix".

@original-brownbear
Copy link
Member

I should also point out that

Investigate if we can detect when two indices share the same mappings to cut down on duplicate work.

is now relatively trivial to do. We track the sha256 of each mapping (hashed via a relatively stable algorithm) and deduplicate MappingMetadata instances in the clusterstate. See Metadata 's org.elasticsearch.cluster.metadata.Metadata#getMappingsByHash for example. This makes detecting if two indices share the same mapping as cheap as comparing two strings.

@jtibshirani
Copy link
Contributor

jtibshirani commented Jan 19, 2022

@original-brownbear sounds good to add an issue for merging responses, thanks! Feel free to ping the search team when you'd like us to pick up work. If everything is tracked in #77466 then I'll close this out.

@jtibshirani
Copy link
Contributor

Oops, I commented at the same time as you. Good to know about the mapping hash. I think we could put that to use when collecting and merging responses.

@jtibshirani
Copy link
Contributor

Closing in favor of these issues:

The first one is tracked in this meta issue: #77466.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

8 participants