Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to SavedObjects.find for fetching more than 10k objects #77961

Closed
joshdover opened this issue Sep 18, 2020 · 13 comments
Closed

Add support to SavedObjects.find for fetching more than 10k objects #77961

joshdover opened this issue Sep 18, 2020 · 13 comments
Assignees
Labels
Feature:Saved Objects NeededFor:Fleet Needed for Fleet Team NeededFor:Security Solution SIEM, Endpoint, Timeline, Analyzer, Cases Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@joshdover
Copy link
Contributor

joshdover commented Sep 18, 2020

The current find API on the SavedObjectRepository cannot page through large data sets due to the index.max_result_window setting in Elasticsearch which defaults to 10,000 objects. This is starting to limit what plugins can build and we have several types of SOs that may have large numbers of objects now (SIEM's exception lists come to mind).

To alleviate this, we could add scrolling support to SavedObjects, however, there is one significant caveat: scrolls have a TTL in Elasticsearch, meaning that cursors are invalidated after a fairly short time period by default. Clients would need to be aware of this and handle it properly and it may not be easy to realize this in development.

Another option could be _async_search APIs, but those are not available in OSS distributions.

This issue definitely needs further investigation, but I wanted to open it to start collecting usecases where it would be useful.
related: #22636
#64715

@joshdover joshdover added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Saved Objects labels Sep 18, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-platform (Team:Platform)

@joshdover joshdover changed the title Add scrolling support to SavedObjects Add support to SavedObjects.find for fetching more than 10k objects Sep 18, 2020
@azasypkin
Copy link
Member

Thanks for filing this issue, @joshdover.

I've just encountered this limitation in the scope of #72420 as well. In a nutshell, when admin wants to bulk rotate SO encryption key we should fetch and process all SO (ideally in batches with a configurable size to balance the load), from all spaces, for all SO types that may have encrypted attributes. And these days we may have quite a lot of them (alerts, actions, fleet related SOs).

The fact that we also update some of the fetched results makes paging with only perPage/page even more complex even if we have less than 10k of SOs.

Do you happen to have any recommended workarounds for SO use cases like this? If not, is there anything we can help with to boost priority for this enhancement?

cc @elastic/kibana-security

@pgayvallet
Copy link
Contributor

To alleviate this, we could add scrolling support to SavedObjects, however, there is one significant caveat: scrolls have a TTL in Elasticsearch, meaning that cursors are invalidated after a fairly short time period by default.

This is obviously a very stupid option in term of memory usage, but still asking: could using the scroll API internally in the SO repository be an option? We could, when page+perPage > index.max_result_window, use the scroll API under the hood, fetch all, and return the aggregated results?

scrolls have a TTL in Elasticsearch, meaning that cursors are invalidated after a fairly short time period by default

This is a configurable via the scroll option though, so If we expose a new scroll API to the SOR/SOC, we could just expose it. But yea, the ttl handling would have to be handled by the consumer anyway, there is not much we could do when _search/scroll is called with an expired TTL.

@jen-huang
Copy link
Contributor

Hi Josh, as we discussed last week, the current limitation impacts the scalability of the Fleet effort. Every agent that connects to Fleet is stored as a saved object that can be managed in the UI. The limitation is currently not too bad for us as we are actively working on improving performance to handle a large number of agents, so the number of users who will reach this limit is small. But we will soon want to get to a point where we can handle >10k agents smoothly so the number of large-scale users will increase. #78520 for describes how the current SO client limits us and our current UI workaround.

cc @ph for awareness & prioritization

@mshustov
Copy link
Contributor

mshustov commented Nov 5, 2020

In the 7.12 release, the team is going to investigate the basic architecture.

@XavierM
Copy link
Contributor

XavierM commented Nov 9, 2020

To alleviate this, we could add scrolling support to SavedObjects, however, there is one significant caveat: scrolls have a TTL in Elasticsearch, meaning that cursors are invalidated after a fairly short time period by default.

This is obviously a very stupid option in term of memory usage, but still asking: could using the scroll API internally in the SO repository be an option? We could, when page+perPage > index.max_result_window, use the scroll API under the hood, fetch all, and return the aggregated results?

scrolls have a TTL in Elasticsearch, meaning that cursors are invalidated after a fairly short time period by default

This is a configurable via the scroll option though, so If we expose a new scroll API to the SOR/SOC, we could just expose it. But yea, the ttl handling would have to be handled by the consumer anyway, there is not much we could do when _search/scroll is called with an expired TTL.

@pgayvallet, so we did try to use the scroll api of Elasticsearch at the beginning of SIEM, but we had a problem with this approach because we did not have any EUI table at this time who is working with the scroll API. Of course, we thought about using simple pagination just like that < > but we got user feedback about it and they did not like it because they did not have the feeling that we had their data in hands. So we had to refactor our query to use the simple from and size and we remove the last page when we have more than 10000 rows, by doing that we were able to get back on our feet.

I am sharing that because using scroll API will break most of our table since if you click on page 3, you won't know the cursor of page 1 but not the cursor of page 3 or you will have to do three queries to get to know the cursor of page 3. Anyway, I will love to be aware of your approach here since I think every solution is dealing with the same kind of problem.

@pgayvallet
Copy link
Contributor

pgayvallet commented Nov 20, 2020

I've been looking at async_search, scroll and search_after, and I came to the conclusion that neither of these options would totally address the problem we are facing here.

For scenarios where we are doing bulk processing of a very large number of object on the server-side, all these solutions would work, as they all would allow to 'scroll' all the results for a query that would exceed the index.max_result_window value.

However, as @XavierM mentioned in his comment, one of the most common scenario where we face this limitation is when displaying the saved objects in a paginated table in the UI.

To demonstrate, using as example the saved object management table, where we are displaying all the visible saved objects:

In this table, we are paginating the results per pages of, say, 100 items. The pagination buttons allow the user to navigate forward, back, to the first or to the last page. Currently, when accessing the page PAGE that displays PER_PAGE results, we are calling _search with from: PER_PAGE * (PAGE - 1), size: PER_PAGE. These parameters are deterministic and can be computed independently of the page the user is current on.

As the user is able to navigate from any page to any other page, backward or forward, none of the suggested solutions would work to address this index.max_result_window limitation:

  • async_search would just allow us to retrieved partial results while the search is running (which achieves nothing), or to return the full list of results. This full list is of little help to display a specific page, and would force us to 'cache' the full list of result to perform pagination on our side, either on the client (not really an option) or the server (which introduces quite a lot of complexity regarding cache invalidation / entry removal on LARGE cached data)

  • search_after only allows us to fetch the results following the last performed request. Meaning that when we are displaying page X, we can only display page X+1 next. Which doesn't answer the use case to navigate from any arbitrary page to another one.

  • scroll is even worse in that case, as the search context got a TTL, this would just doesn't work with user-initiated requests (w). It's even stated in the official ES doc: The Scroll api is recommended for efficient deep scrolling but scroll contexts are costly and it is not recommended to use it for real time user requests. The search_after parameter circumvents this problem by providing a live cursor

Which is why I'm wondering: As our indices are now system indices, could we just ask the ES team to change the default value of index.max_result_window on these system indices to a higher value? I mean, it wouldn't solve the problem per say, but by setting this value to 100k by default instead of 10k, we could probably work around for most, if not all, of our (current) volumetry without any change on the codebase. Or course, we should also confirm with ES that this would be alright performance-wise.

@joshdover
Copy link
Contributor Author

As our indices are now system indices, could we just ask the ES team to change the default value of index.max_result_window on these system indices to a higher value?

I don't think we'd need to wait for system indices for this? We should be able to set this setting directly on the index during the migration process.

Or course, we should also confirm with ES that this would be alright performance-wise.

This is my primary question. I'm curious if the Elasticsearch performance issues scale with the number of documents or the size (as in bytes) of the results. Since we're primarily paginating large numbers of really small documents, I'm hoping it's the latter.

Would anyone from @elastic/es-perf be able to shed light on this? Specifically, what is the reason for the index.max_result_window setting defaulting to 10k and what types of problems typically surface when increasing this limit? We're trying to paginate through 100k+ of small documents in the UI and increasing this limit on the .kibana index would be the 'easiest' solution from our perspective.

@rudolf
Copy link
Contributor

rudolf commented Nov 24, 2020

I think changing index.max_result_window might be sufficient for saved objects management because it's relatively rare that users use the UI or export objects. But the performance penalty is probably too high for regular searches from plugins that need to page through more than 10k results. So I think this will eventually bite us.

I think we'll have to do something similar to what @XavierM mentioned where the UI works around the problem. I'm not sure what page size we currently use, but more than 100 results probably don't fit on a screen which means 10k results is at least 100 pages. I don't think users will ever need more pages than that, they should rather narrow down their search. So then the UI could use from and size and display a message like "Your search results were too large to display, only showing the first 10 000 results". We could also use a pattern similar to gmail's "Select all conversations that match this search" to allow users to export all the saved objects that match a search even if there's more than 10k results. That would require the export API to accept a query (and maybe a KQL filter too).

Screenshot 2020-11-24 at 11 57 03

We would have to add a tiebreaker field to all saved objects to allow the find API to support > 10k search results using search_after.

@pgayvallet
Copy link
Contributor

pgayvallet commented Nov 24, 2020

I overall agree that implementing search_after for the SO find API seems the most straightforward.

We would have to add a tiebreaker field to all saved objects

The good old {type}|{id} already used in a lot of places, at least on the client-side.

Note that one notable constraint/limitation of such approach is that we would only be allowed to sort by this tie_breaker_id in the successive _find requests, as search_after requires the value of ALL the sorting fields. So the SavedObjectsFindOptions.sortField would be blanked in that case. This seems acceptable though.

Also, this would require to migrate all SO objects during a migration to populate this field, which is kinda unsupported at the moment (even if adding 'internal' migrations to core to impact all types should be doable)

@pgayvallet
Copy link
Contributor

pgayvallet commented Dec 16, 2020

As mentioned during the breakout session, we should probably split that into two distinct tasks:

  • Add support of a tie breaker field for all saved object types
    • add support for 'core' migrations
    • add a 'core' migration to populate the tie breaker field on all our existing objects
    • add the mechanism to add/update the tie breaker field when performing CRUD operation on SOs (create, update and their bulk equivalent)
  • Support / implement search_after in the SO find API

The main goal of this issue is to address the problems related to SO import/export and our consumers that needs to 'scroll' over more than 10k objects. The 'paginated tables' issue when there is more than 10k objects should be addressed at a later time (and probably just by informing the user than only the 10k first objects can be displayed, as suggested in #77961 (comment))

@rudolf
Copy link
Contributor

rudolf commented Dec 17, 2020

I've opened #86300 to discuss how we might add a tie_breaker, but as a first iteration we could use search_after with a PIT which doesn't require a tie_breaker field. For the use case of creating an export of all saved objects using a PIT is preferred because it will provide a consistent snapshot (without it it's theoretically possible that an export has broken references if an object was deleted in the middle of the export)

@lukeelmers
Copy link
Member

lukeelmers commented Feb 24, 2021

Closing as we should now have the ability to page through >10k objects as of #89915

(We still have a separate issue open regarding adding a tiebreaker, but that did not end up blocking this effort after all)

[edit] Feel free to re-open if there's something I've missed here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Saved Objects NeededFor:Fleet Needed for Fleet Team NeededFor:Security Solution SIEM, Endpoint, Timeline, Analyzer, Cases Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests

9 participants