Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for 64-bit unsigned integers #32434

Closed
jpountz opened this issue Jul 27, 2018 · 13 comments · Fixed by #60050
Closed

Add support for 64-bit unsigned integers #32434

jpountz opened this issue Jul 27, 2018 · 13 comments · Fixed by #60050
Assignees
Labels
>feature :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@jpountz
Copy link
Contributor

jpountz commented Jul 27, 2018

As a follow-up to discussions in #17006, we agreed in FixitFriday to add support for 64-bit unsigned integers. These numbers will support accurate term and range queries, but aggregations will work on the nearest double value.

@jpountz jpountz added >feature help wanted adoptme :Search Foundations/Mapping Index mappings, including merging and defining field types labels Jul 27, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@rw-access
Copy link
Contributor

Just adding this note here for reference.

We ran into a use case for this in elastic/ecs#673. The Windows registry only has a concept of unsigned numbers, so we saw two paths to take for numbers greater than 263 for u64. We either cast to s64, which is hard to enforce and too easy to do wrong. The other option was to only store the string representation. Since we have little need for math, we opted for a keyword string field in the interim.

@marshallmain
Copy link
Contributor

Following up on @rw-access comment, there are a fair number of data fields coming from the Elastic Endpoint that we would like to store as u64 if possible. Is this issue being actively worked on?

@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
@ktetzlaff
Copy link

Same problem here during processing of network performance data from routers as part of a network management system. This data contains counters (e.g. received octets) which are 64-bit unsigned values. It would be great if we could store them properly.

@mayya-sharipova mayya-sharipova removed the help wanted adoptme label Jul 6, 2020
@mayya-sharipova mayya-sharipova self-assigned this Jul 6, 2020
mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Jul 22, 2020
This field type supports
- indexing of integer values from [0, 18446744073709551615]
- precise queries (term, range)
- sorting and aggregations is based on conversion of long values
  to double and can be imprecise for large values.

Closes elastic#32434
@mayya-sharipova
Copy link
Contributor

We, within the search team, had discussion about the following points:

I. How to represent unsigned long internally, in json, and the elasticsearch clients for:

  • Queries
  • Sort values
  • Aggs
  • Scripts

Several options here are:

  1. Long for queries, sort values and scripts; Double for aggs
    • easy to handle
    • as values > Long.MAX_VALUE are presented as negative, not a nice experience for users especially when unsigned long is used for IDs.
  2. BigInteger for query field values, sort values, scripts; Double for aggs
    • difficult to handle internally
    • some elasticsearch clients may not support them (need to confirm with the lang-client team)
  3. String
    • easy to handle
    • easy to support
    • no support for numeric metrics aggs (do we need them on unsigned long field?)

II. Should we accept mixing longs and unsigned longs when sorting on a field?
- Do we need this mixed sorting on longs and unsigned longs?

@robcowart
Copy link

Option 3 is a non-starter for any use-cases that I have worked on. The most common scenario that I see is related to metrics which are counters. Besides the network equipment mentioned by @ktetzlaff, a significant amount of server and application monitoring data are unsigned 64-bit counters. For example, this is the package that Metricbeat uses to gather disk I/O stats (https://github.com/shirou/gopsutil/blob/master/disk/disk.go#L14) and as you can see there are a number of uint64 values. Beats actually sends these to Elasticsearch as a long, which is technically wrong, but to date simply unavoidable. So this feature is needed to "fix" many Beats modules as well.

Given that counters will make up the majority of unsigned values, it is important (mandatory really) that at least the derivative aggregation work well and accurately. Whether that is best achieved with option 1 or 2... I would say that 2 is probably preferred. However if 1 is the only way to get the necessary support, and it allows for derivative aggregations, I could live with it.

@Dumbaz
Copy link

Dumbaz commented Aug 5, 2020

I am in the same Situation as @robcowart : Option 3 would not achieve anything for me in my usecases. The main reason I want 64 bit integers is because sometimes metric values are greater than the maximum size the current data types offer (mostly long running tasks, counted in milliseconds).

@jimczi
Copy link
Contributor

jimczi commented Aug 6, 2020

I think we should eliminate option 1 from the picture. It's confusing to rely on a convention that requires to translate a signed long. Option 2 is more natural since we already have an abstraction for numerics in Scripts and sort values. They return a Number and the user can choose the resolution that he needs when consuming the value. Unsigned longs could for instance return a BigInteger for values greater than Long.MAX_VALUE.
@elastic/es-clients would you be able to comment on the proposal ? The json specification accepts numbers without restrictions so we were wondering if we need to do anything special to ease the usage of these 64 bits numbers in clients ?

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Aug 6, 2020

Thanks for comments everyone, very valuable.

@jimczi I have chatted with the es-clients team, all the clients (php, perl, ruby, python, .NET) support BigInteger except JavaScript. Also for some clients with 32bit platform (which should be quite outdated) don't support BigInteger, e.g. php, ruby.

For JavaScript, the language does support BigInts, but the native json parser does not. But this should not be a blocker for us, the es-clients team are willing to write a custom json parser to handle this case.

One thing is left is to confirm with Kibana team if they can handle 64 bit integers.

Update:
I have chatted with the Kibana team, and Kibana has problems with handling big integers (relevant issue: elastic/kibana#40183), and this is not going to be fixed any time soon.
But considering that currently Kibana also has issues with large longs, and citing @timroes "I don't think adding a new type to the list that will cause exactly the same issues will be a problem", I guess we can still proceed with our option 2.
@jimczi what do you think?

Updated option 2:

  1. Sort: return long for values < 2^63, BigInteger for values >= 2^63
  2. Scrips: return long for values < 2^63, BigInteger for values >= 2^63
  3. Aggregations:
    • terms agg: return long or BigInteger for buckets (if this field is used for "ids", it is valuable to have precise bucket keys)
    • numeric metric aggs: return doubles

@Mpdreamz
Copy link
Member

Just to confirm: are we planning a maximum of 64bits unsigned long or any arbitrary number BigInteger? Implications to changes are the same but in terms of (de)serialization performance its good to know what to optimize for.

There are some potential client breaking changes.

Javascript client: will need a dedicated generic serializer which is going to be slower for everything. The workaround here is that the default WON'T support ulong/bigints OOTB but the client can be injected with a serializer that supports it either globally or per request.

Java HLRC: The java HLRC relies heavily on boxing, e.g the to and from on RangeQueryBuilder is an Object so is future proof 👍 (at the cost of boxing). Not sure if this is is the case everywhere the change is intended to be supported.

.NET NEST client: In most cases we have dedicated types and interface e.g the range query is backed by an interface IRangeQuery with implementations like LongRangeQuery, DateRangeQuery that strongly type the to/from parameters. We can extend this with e.g a BigIntegerRangeQuery quite easily. In some cases (especially responses) that are now capped as long or double changing that to BigInteger or BigDecimal is technically a breaking change.

I suggest we proceed with option 2 and:

  • try to solve any breaking changes as best as we can.
  • Accept a technical breaking change that in practice is hard to hit (e.g long implicitly converting to BigInteger and back might be good enough?).
  • If all else fails document the limitations of a particular client and set it up for success when 8.0 comes around.

@ezimuel
Copy link
Contributor

ezimuel commented Aug 12, 2020

PHP client: it supports 64-bit integer but not unsigned. We can support BigInteger using the GMP extensions of PHP (not default in the language).
Perl: it supports 64-bit by default and BigInteger using bigint.

Because some clients may need to support 64-bit and BigInteger in different ways, I suggest to provide some facilitation, for instance:

  • add a bigint_as_string=1 in the query string (0 by default) to return 64-bit and BigInteger as string; this will prevent JSON decoder to throw errors if numeric > 32-bit.
  • add a numeric_type field in the response that will alert the client for using 64 (bit) or big (as BigInteger). This alert will not have 32 (bit) value, as default. This alert will be triggered if at least one numeric value in the response is 64 or big.

@delvedor
Copy link
Member

I've been thinking a lot about this, and I fear that sending numbers above the interop range for json could cause problems to some implementations.

From the json spec:

   This specification allows implementations to set limits on the range
   and precision of numbers accepted.  Since software that implements
   IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is
   generally available and widely used, good interoperability can be
   achieved by implementations that expect no more precision or range
   than these provide, in the sense that implementations will
   approximate JSON numbers within the expected precision.  A JSON
   number such as 1E400 or 3.141592653589793238462643383279 may indicate
   potential interoperability problems, since it suggests that the
   software that created it expects receiving software to have greater
   capabilities for numeric magnitude and precision than is widely
   available.

   Note that when such software is used, numbers that are integers and
   are in the range [-(2**53)+1, (2**53)-1] are interoperable in the
   sense that implementations will agree exactly on their numeric
   values.

For instance, if JavaScript will get a value outside that range, it cannot longer guarantee arithmetic precision (MDN docs).
Furthermore, if JS will try to parse a JSON with a bigint value, it will round it without warn the user, causing a lot of pain to debug it.

Given that a custom serializer will introduce significant overhead, and it should be a user choice accept it or not, I agree with @ezimuel that introducing a query parameter such as bigint_as_string that will instruct Elasticsearch on how those values are being sent and read could be a good tradeoff.
By doing so, we can guarantee interoperability and still support all the use cases that need bigint as numeric values.

The JS client could use this query parameter as follows:
Send it by default to all requests, so we are sure not to lose precision and let the user decide how to handle the bigint values, but also give the users the ability to disable this default option and use a custom serializer to handle Bigint automatically. (this cannot be a default for the JS client for two reasons, the json bigint parser is significantly slower and the number and bigint types are not compatible, so the user must know what they are doing).

Unfortunately, this is not a JS problem only, this issue will happen also with other languages. For example if you use jq, any value above 1.79E+308 will be rounded to 1.79E+308.

$ echo '{"n":1.7976931348623157e+310}' | jq .

{
  "n": 1.7976931348623157e+308
}

By supporting something like bigint_as_string, we delegate to the user this choice, which they can take based on the platform they need to support.

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Aug 13, 2020

@ezimuel @delvedor Thanks a lot for comments and explanation. We've discussed this and decided that we should find a systematic way for the REST layer to handle big integers (also big longs), and this should not be a part of this Issue/PR. I have created a separate issue for this and put your ideas there.

For 64-bit unsigned integers, we have decided to proceed with our current plan to return Long/BigInteger.

mayya-sharipova added a commit that referenced this issue Sep 23, 2020
This field type supports
- indexing of integer values from [0, 18446744073709551615]
- precise queries (term, range)
- precise sort and terms aggregations
- other aggregations are based on conversion of long values
  to double and can be imprecise for large values.

Closes #32434
mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Sep 24, 2020
This field type supports
- indexing of integer values from [0, 18446744073709551615]
- precise queries (term, range)
- precise sort and terms aggregations
- other aggregations are based on conversion of long values
  to double and can be imprecise for large values.

Backport for elastic#60050
Closes elastic#32434
mayya-sharipova added a commit that referenced this issue Sep 24, 2020
Introduce 64-bit unsigned long field type

This field type supports
- indexing of integer values from [0, 18446744073709551615]
- precise queries (term, range)
- precise sort and terms aggregations
- other aggregations are based on conversion of long values
  to double and can be imprecise for large values.

Backport for #60050
Closes #32434
@javanna javanna added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
Development

Successfully merging a pull request may close this issue.