Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Add a cluster setting for memory threshold to pick OrdinalsCollector in Cardinality aggregation #15269

Open
rishabhmaurya opened this issue Aug 15, 2024 · 9 comments · May be fixed by #15764
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request good first issue Good for newcomers Search:Aggregations

Comments

@rishabhmaurya
Copy link
Contributor

rishabhmaurya commented Aug 15, 2024

Is your feature request related to a problem? Please describe

Currently the logic to pick Ordinals vs DirectCollector is dynamic and based on the number of ordinals to collect and memory overhead. If memory overhead is high for OrdinalsCollector, then DirectCollector is used. Due to https://issues.apache.org/jira/browse/LUCENE-9663 few users are reporting regression in Cardinality aggregation because of slower DirectCollector after replacing prefix compression with LZ4 compression for terms dictionary in lucene 8.9.
Fix proposed here is to use OrdinalsCollector more often which will collect the ordinals into a bitset first and then performs term lookup in postCollect() of segment and that's a lot faster.
However, we don't have a way to control picking up OrdinalsCollector in OpenSearch.

Describe the solution you'd like

Introduce a memory threshold dynamic setting which OrdinalsCollector can use and if its usage is under this threshold, always use OrdinalsCollector. This logic can be added here.

Related component

Search:Aggregations

Describe alternatives you've considered

Use of eager_global_ordinals or murmur hash, but its an index time setting and can have impact on indexing performance. Its discussed in more detail here.

Additional context

No response

@rishabhmaurya rishabhmaurya added enhancement Enhancement or improvement to existing feature or request good first issue Good for newcomers untriaged labels Aug 15, 2024
@mch2 mch2 removed the untriaged label Aug 21, 2024
@mch2
Copy link
Member

mch2 commented Aug 21, 2024

Thanks @rishabhmaurya seems like something we should do, am removing untriaged.

@maitreya2954
Copy link

maitreya2954 commented Aug 28, 2024

@mch2 I would like to work on this. I am new to contributing to OpenSearch. So, would appreciate any guidance.

@jainankitk
Copy link
Collaborator

Thanks @maitreya2954 for volunteering, assigned it to you!

@rishabhmaurya
Copy link
Contributor Author

@maitreya2954 Appreciate you picking this up. Here is what we can do -

  1. Introduce execution_hint for cardinality aggregation similar to term aggregation, with values: direct, ordinal to start with.
  2. The default behaviour should be current behaviour. When execution_hint is provided, override the collector to be picked here and don't honour this condition here.
  3. Add unit tests for all 3 cases.
  4. Add documentation here by opening a documentation issue on this repo.

If you're blocked don't hesitate to ask for pointers. Thank you.

@maitreya2954 maitreya2954 linked a pull request Sep 5, 2024 that will close this issue
3 tasks
@maitreya2954
Copy link

@rishabhmaurya I added execution hint field to the cardinality agg. However, I am stuck at adding unit tests.

Add unit tests for all 3 cases.

3 cases being: direct, ordinal and when no value is provided. right?

Here's my question: What type of query should I test after setting the execution hint. I can see many different type of queries being tested here. Should I just do MatchAllDocsQuery?

@rishabhmaurya
Copy link
Contributor Author

rishabhmaurya commented Sep 9, 2024

@maitreya2954 That's great that you were able to add execution hint.

3 cases being: direct, ordinal and when no value is provided. right?

yes.

What type of query should I test after setting the execution hint. I can see many different type of queries being tested here. Should I just do MatchAllDocsQuery?

You can test for both match_all query and a filter using term query on a field other than cardinality aggregation field.

@rishabhmaurya
Copy link
Contributor Author

@maitreya2954 how were you thinking about checking which collector is picked from tests?

One way could add a method to this class to get the current collector picked for a given segment and add a logic here to store the name/class -

public LeafBucketCollector getLeafCollector(LeafReaderContext ctx) throws IOException {

@maitreya2954
Copy link

maitreya2954 commented Sep 13, 2024

@rishabhmaurya Here's the PR: #15764

Documentation PR: opensearch-project/documentation-website#8265

@rishabhmaurya
Copy link
Contributor Author

@maitreya2954 great, thanks! I will take a look shortly. Meanwhile you can work on fixing gradle checks which are failing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request good first issue Good for newcomers Search:Aggregations
Projects
Status: 🆕 New
Development

Successfully merging a pull request may close this issue.

4 participants