Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option for case insensitive search at runtime #61162

Closed
5 of 7 tasks
markharwood opened this issue Aug 14, 2020 · 6 comments
Closed
5 of 7 tasks

Option for case insensitive search at runtime #61162

markharwood opened this issue Aug 14, 2020 · 6 comments
Assignees
Labels
>enhancement Meta :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@markharwood
Copy link
Contributor

markharwood commented Aug 14, 2020

This meta issue tracks the various changes relating to offering a "case insensitive" option to various term-level queries (term, terms, prefix, wildcard, regex) at search time. It replaces the previous #53603 which meandered with various discussions.
In query DSL we will offer a new case_insensitive flag which can only be set to true to enable new behaviour. When left unset the existing behaviour is used (which is inconsistent - keyword fields with normalizers normalize query terms while text fields do not). Due to these inconsistencies and lack of guarantees, setting the case_insensitive flag to false will throw an error.

Tasks

@markharwood markharwood added >enhancement :Search/Search Search-related issues that do not fall into other categories v8.0.0 labels Aug 14, 2020
@markharwood markharwood self-assigned this Aug 14, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@elasticmachine elasticmachine added the Team:Search Meta label for search team label Aug 14, 2020
@jimczi jimczi added the Meta label Aug 14, 2020
@jimczi jimczi changed the title Meta issue - option for case insensitive search at runtime Option for case insensitive search at runtime Aug 14, 2020
@jimczi jimczi removed the v8.0.0 label Aug 14, 2020
@markharwood
Copy link
Contributor Author

markharwood commented Aug 26, 2020

@jimczi / @jpountz - shall I open Lucene issues to add case insensitive param to TermQuery, PrefixQuery and WildcardQuery or create case insensitive variants for elasticsearch?

@jpountz
Copy link
Contributor

jpountz commented Aug 26, 2020

I wonder if we really need new queries or if we could reuse AutomatonQuery to build case-insensitive variants?

@markharwood
Copy link
Contributor Author

markharwood commented Aug 26, 2020

I wonder if we really need new queries or if we could reuse AutomatonQuery to build case-insensitive variants?

PrefixQuery and WildcardQuery already do reuse AutomatonQuery? They're lightweight subclasses that take args and implement a toAutomaton() function. The case insensitive options we want to add can either be added as an "if" statement in core Lucene or we fork those classes as something like this:

public class CaseInsensitivePrefixQuery extends CaseInsensitiveAutomatonQuery {

/** Constructs a case insensitive query for terms starting with <code>prefix</code>. */
public CaseInsensitivePrefixQuery(Term prefix) {
    super(prefix, toAutomaton(prefix.bytes()), Integer.MAX_VALUE, true);
}

/** Build an automaton accepting all terms with the specified prefix, case insensitive. */
public static Automaton toAutomaton(BytesRef prefix) {
    if (prefix == null) {
        throw new NullPointerException("prefix must not be null");
    }
    List<Automaton> list = new ArrayList<>();
    String s = prefix.utf8ToString();
    Iterator<Integer> iter = s.codePoints().iterator();
    while (iter.hasNext()) {
        list.add(toCaseInsensitiveChar(iter.next(), Integer.MAX_VALUE));
    }
    list.add(Automata.makeAnyString());

    Automaton a = Operations.concatenate(list);
    a = MinimizationOperations.minimize(a, Integer.MAX_VALUE);
    return a;
}

The CaseInsensitiveAutomatonQuery base class proposed above offers toCaseInsensitiveChar helper function to help create [Ff][Oo][Oo] type sequences from foo input

@mbudge
Copy link

mbudge commented Oct 12, 2020

The problem we have with the beats templates is they explicitly set each field, which takes priority over any settings applied through dynamic templates. In event-management and incident response, we need case insensitive search to mitigate the risk of important events being missed due to keywords being case sensitive, and data being collected from many different systems on the network. We would need to write a lot of code to do the string normalisation for every field in every parser/Logstash.

Instead we have a python script which adds the lowercase normaliser to every field in the beats template. But this means we have to run the template through the script every time a new version is released. With elastic moving to doing beats/template management through fleet, and enrichment moving from the javascript to ingest pipeline, we would still have to run each template through the python script to add the lowercase normaliser.

We would be happy with an index level setting which adds the lowercase normaliser to every field when the index is created. That way teams who want to lowercase all keywords can apply this setting once in the index settings, and use KQL to do case-insensitive search without needing to add multi-fields.

@markharwood
Copy link
Contributor Author

Closing as complete because with the 2 remaining tasks there were issues:

Query string case insensitive regex - there was no clean way to add the /i syntax to Lucene in a backwards compatible way.

Terms query - concerns over query complexity explosion and performance meant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement Meta :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

5 participants