Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Fuzziness reject illegal values earlier #33511

Merged
merged 13 commits into from
Apr 5, 2019

Conversation

cbuescher
Copy link
Member

The current java implementation of Fuzziness leaves a lot of room for
initializing it with illegal values that either cause errors later when the
queries reach the shards where they are executed or values that are silently
ignored in favour of defaults. We should instead tighten the java implementation
of the class so that we only accept supported values. Currently those are
numeric values representing the edit distances 0, 1 and 2, optionally also as
float or string, and the "AUTO" fuzziness, which can come in a cusomizable
variant that allows specifying two value that define the positions in a term
where the AUTO option increases the allowed edit distance.

This change removes several redundant ways of object construction and adds input
validation to the remaining ones. Java users should either use one of the
predefined constants or use the static factory methods fromEdits(int) or
fromString(String) to create instances of the class, while other ctors are
hidden. This allows for instance control, e.g. returning one of the constants
when creating instances from an integer value.

Previously the class would accept any positive integer value and any float
value, while in effect the maximum allowed edit distance was capped at 2 in
practice. These values while throw an error now, as will any other String value
other than "AUTO" that where previously accepted but led to numeric exceptions
when the query was executed.

@cbuescher cbuescher added >enhancement >breaking :Search/Search Search-related issues that do not fall into other categories v7.0.0 labels Sep 7, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

The current java implementation of Fuzziness leaves a lot of room for
initializing it with illegal values that either cause errors later when the
queries reach the shards where they are executed or values that are silently
ignored in favour of defaults. We should instead tighten the java implementation
of the class so that we only accept supported values. Currently those are
numeric values representing the edit distances 0, 1 and 2, optionally also as
float or string, and the "AUTO" fuzziness, which can come in a cusomizable
variant that allows specifying two value that define the positions in a term
where the AUTO option increases the allowed edit distance.

This change removes several redundant ways of object construction and adds input
validation to the remaining ones. Java users should either use one of the
predefined constants or use the static factory methods `fromEdits(int)` or
`fromString(String)` to create instances of the class, while other ctors are
hidden. This allows for instance control, e.g. returning one of the constants
when creating instances from an integer value.

Previously the class would accept any positive integer value and any float
value, while in effect the maximum allowed edit distance was capped at 2 in
practice. These values while throw an error now, as will any other String value
other than "AUTO" that where previously accepted but led to numeric exceptions
when the query was executed.
@cbuescher
Copy link
Member Author

@elasticmachine test this please

* Note: Using this method only makes sense if the field you are applying Fuzziness to is some sort of string.
* @throws IllegalArgumentException if the edit distance is not in [0, 1, 2]
*/
public static Fuzziness fromEdits(int edits) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method used to be very permissive, e.g. parsing float values and silently casting them to ints, where values larger than 2 get ignored somewhere later in the code I think.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cbuescher , it looks good. I left one question

} else if (upperCase.startsWith("AUTO:")) {
return parseCustomAuto(upperCase);
} else {
// should be a float or int representing a valid edit distance, otherwise throw error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we restrict to int ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I see this still leaves the door open for something like "fuzziness" : 1.5 to be silently cast to int 1. At the same time I think it would be good to continue to allow floats equivalent to the allowed edit distances 0,1,2, e.g. allow "1.0" or "1.0000" but reject "1.1". This is because I remember other issues where users going through rest sometimes had issues that their json-strigification produces float-like values even if they feed them ints (I think this was mostly in javascript clients). Anyway, would you be okay to allow "0.0", "1.0" and "2.0" so anything evenly divisible still?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@cbuescher
Copy link
Member Author

@jimczi I added a commit that disallows float values with 0 decimals (like "1.00", "2.0" etc...) for the three allowed values but rejects other floats. Let me know if that sounds okay to you.

@cbuescher
Copy link
Member Author

@elasticmachine run the gradle build tests 1

@jasontedor jasontedor added v8.0.0 and removed v7.0.0 labels Feb 6, 2019
@cbuescher
Copy link
Member Author

@jimczi just resurrected this PR from the bottom of my stack, I still would like to get it in, expecially after other recent fuzziness bugfixes (#39643). Would you mind taking another look?

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed the updates on this pr. There are some conflicts that need to be resolved but the logic seems good to me.

} else if (upperCase.startsWith("AUTO:")) {
return parseCustomAuto(upperCase);
} else {
// should be a float or int representing a valid edit distance, otherwise throw error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@cbuescher
Copy link
Member Author

@jimczi thanks for the review, I just resolved the merge conflict, will go over the PR again since its a bit old to check whether the docs need some updating anywhere and add a migration note since this changes the Java API slightly. Currently also only targeting 8 with this cleanup.

@jimczi
Copy link
Contributor

jimczi commented Apr 4, 2019

Sure, thanks for reviving this and +1 to target 8 only.

@cbuescher cbuescher merged commit 1597f69 into elastic:master Apr 5, 2019
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Apr 5, 2019
* elastic/master: (36 commits)
  Remove unneded cluster config from test (elastic#40856)
  Make Fuzziness reject illegal values earlier (elastic#33511)
  Remove test-only customisation from TransReplAct (elastic#40863)
  Fix dense/sparse vector limit documentation (elastic#40852)
  Make -try xlint warning disabled by default. (elastic#40833)
  Async Snapshot Repository Deletes (elastic#40144)
  Revert "Replace usages RandomizedTestingTask with built-in Gradle Test (elastic#40564)"
  Init global checkpoint after copy commit in peer recovery (elastic#40823)
  Replace usages RandomizedTestingTask with built-in Gradle Test (elastic#40564)
  [DOCS] Removed redundant (not quite right) information about upgrades.
  Remove string usages of old transport settings (elastic#40818)
  Rollup ignores time_zone on date histogram (elastic#40844)
  HLRC: fix uri encode bug when url path starts with '/' (elastic#34436)
  Mutes GatewayIndexStateIT.testRecoverBrokenIndexMetadata
  Docs: Pin two IDs in the rest client (elastic#40785)
  Adds version 6.7.2
  [DOCS] Remind users to include @ symbol when applying license file (elastic#40688)
  HLRC: Convert xpack methods to client side objects (elastic#40705)
  Allow ILM to stop if indices have nonexistent policies (elastic#40820)
  Add an OpenID Connect authentication realm (elastic#40674)
  ...
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Apr 6, 2019
* master: (63 commits)
  Suppress lease background sync failures if stopping (elastic#40902)
  [DOCS] Added settings page for ILM. (elastic#40880)
  [Docs] Remove extraneous text (elastic#40914)
  Move test classes to test root in Painless (elastic#40873)
  Fix date index name processor default date_formats (elastic#40915)
  Source additional files correctly in elasticsearch-cli (elastic#40890)
  Allow AVX-512 on JDK 11+ (elastic#40828)
  [Docs] Change example to show col headers (elastic#40822)
  Update apache httpclient to version 4.5.8 (elastic#40875)
  Update monitoring-kibana.json (elastic#40899)
  Introduce Delegating ActionListener Wrappers (elastic#40129)
  Deprecate old transport settings (elastic#40821)
  Add Kibana application privileges for monitoring and ml reserved roles (elastic#40651)
  Use Writeable for TransportReplAction derivatives (elastic#40894)
  Add test for HTTP and Transport TLS on basic license (elastic#40714)
  Remove unneded cluster config from test (elastic#40856)
  Make Fuzziness reject illegal values earlier (elastic#33511)
  Remove test-only customisation from TransReplAct (elastic#40863)
  Fix dense/sparse vector limit documentation (elastic#40852)
  Make -try xlint warning disabled by default. (elastic#40833)
  ...
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Apr 6, 2019
* master: (77 commits)
  Suppress lease background sync failures if stopping (elastic#40902)
  [DOCS] Added settings page for ILM. (elastic#40880)
  [Docs] Remove extraneous text (elastic#40914)
  Move test classes to test root in Painless (elastic#40873)
  Fix date index name processor default date_formats (elastic#40915)
  Source additional files correctly in elasticsearch-cli (elastic#40890)
  Allow AVX-512 on JDK 11+ (elastic#40828)
  [Docs] Change example to show col headers (elastic#40822)
  Update apache httpclient to version 4.5.8 (elastic#40875)
  Update monitoring-kibana.json (elastic#40899)
  Introduce Delegating ActionListener Wrappers (elastic#40129)
  Deprecate old transport settings (elastic#40821)
  Add Kibana application privileges for monitoring and ml reserved roles (elastic#40651)
  Use Writeable for TransportReplAction derivatives (elastic#40894)
  Add test for HTTP and Transport TLS on basic license (elastic#40714)
  Remove unneded cluster config from test (elastic#40856)
  Make Fuzziness reject illegal values earlier (elastic#33511)
  Remove test-only customisation from TransReplAct (elastic#40863)
  Fix dense/sparse vector limit documentation (elastic#40852)
  Make -try xlint warning disabled by default. (elastic#40833)
  ...
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
The current java implementation of Fuzziness leaves a lot of room for
initializing it with illegal values that either cause errors later when the
queries reach the shards where they are executed or values that are silently
ignored in favour of defaults. We should instead tighten the java implementation
of the class so that we only accept supported values. Currently those are
numeric values representing the edit distances 0, 1 and 2, optionally also as
float or string, and the "AUTO" fuzziness, which can come in a cusomizable
variant that allows specifying two value that define the positions in a term
where the AUTO option increases the allowed edit distance.

This change removes several redundant ways of object construction and adds input
validation to the remaining ones. Java users should either use one of the
predefined constants or use the static factory methods `fromEdits(int)` or
`fromString(String)` to create instances of the class, while other ctors are
hidden. This allows for instance control, e.g. returning one of the constants
when creating instances from an integer value.

Previously the class would accept any positive integer value and any float
value, while in effect the maximum allowed edit distance was capped at 2 in
practice. These values while throw an error now, as will any other String value
other than "AUTO" that where previously accepted but led to numeric exceptions
when the query was executed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>breaking >enhancement :Search/Search Search-related issues that do not fall into other categories v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants