Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Approximate Percentile #1

Closed
5 tasks
Tracked by #107
JLockerman opened this issue Dec 3, 2020 · 8 comments
Closed
5 tasks
Tracked by #107

Approximate Percentile #1

JLockerman opened this issue Dec 3, 2020 · 8 comments
Assignees
Labels
proposed-feature A proposed feature or function to ease a task

Comments

@JLockerman
Copy link
Contributor

JLockerman commented Dec 3, 2020

What's the functionality you would like to add
An approximate percentile function such as t-digest. This would have two main advantages over percentile_cont:

  1. It can use less space when calculating over large datasets.
  2. The digesting part can be separated from the calculation of the percentile, allowing post-hoc analysis, and more powerful VIEWs.
  3. It can constructed in a way that can be use with continuous aggregates.

How would the function be used

Basic percentile calculation works just like exat percentile calculation, expect it takes in an accuracy measure (for t-digest the number of centroids)

-- example dataset
SELECT * FROM data;
          time          | value 
------------------------+-------
 2020-01-03 01:00:00-05 |   1.0
 2020-01-03 01:10:00-05 |   2.0
 2020-01-03 01:20:00-05 |   3.0
 2020-01-03 01:30:00-05 |   4.0
 2020-01-03 01:40:00-05 |   5.0
           ...
 2020-01-03 16:50:00-05 |  96.0
 2020-01-03 17:00:00-05 |  97.0
 2020-01-03 17:10:00-05 |  98.0
 2020-01-03 17:20:00-05 |  99.0
 2020-01-03 17:30:00-05 | 100.0
(10 rows)

-- regular approximate percentile call
SELECT approx_percentile(value, 0.50, accuracy => 10) FROM data;
 approx_percentile
-------------------
              51.5
(1 row)

-- increasing the number of centroids increases the accuracy at the cost of more storage
SELECT approx_percentile(value, 0.50, accuracy => 100) FROM data;
 approx_percentile
-------------------
              50.5
(1 row)

When storing the data, for instance in continuous aggregates, the digest itself can be stored, instead of the percentile, allowing future analysis to chose which data it wants to get out.

-- example dataset
SELECT * FROM data;
          time          | value 
------------------------+-------
 2020-01-03 01:00:00-05 |   1.0
 2020-01-03 01:10:00-05 |   2.0
 2020-01-03 01:20:00-05 |   3.0
 2020-01-03 01:30:00-05 |   4.0
 2020-01-03 01:40:00-05 |   5.0
           ...
 2020-01-03 16:50:00-05 |  96.0
 2020-01-03 17:00:00-05 |  97.0
 2020-01-03 17:10:00-05 |  98.0
 2020-01-03 17:20:00-05 |  99.0
 2020-01-03 17:30:00-05 | 100.0
(100 rows)

-- we output the /digest/ not the calculated percentile, with a fixed accuracy
-- in real usage this would likely be a continuous aggregate instead of a regular view
CREATE VIEW aggregated AS SELECT precentile_digest(value, accuracy => 100) FROM data;
INSERT 0 1

-- we can output various percentiles from the same digest
SELECT
    approx_percentile(digest, 0.10) tenth,
    approx_percentile(digest, 0.50) median,
    approx_percentile(digest, 0.50) ninetieth
FROM data;
 tenth | median | ninetieth
-------+--------+-----------
  10.9 |   50.5 |  90.10001
(1 row)

-- and also other certain other outputs, depending on the digest
SELECT avg(digest), min_val(digest), max_val(digest), num_elements(digest) FROM aggregated;
  avg | min |  max  | num_elements
------+-----+-------+--------------
 50.5 | 1.0 | 100.0 |          100
(1 row)

Open Questions
TBD

Remaining Work

  • Is our choice of k() acceptable?
  • Cleanup of the type and function names
  • Cleanup of the TEXT input and output for the t-digest type (this is a more general issue for all our types).
  • Are t-digest merges handled correctly if they have different creation parameters?
  • Are there any outstanding serialization issues?
@JLockerman JLockerman added the proposed-feature A proposed feature or function to ease a task label Dec 3, 2020
@JLockerman
Copy link
Contributor Author

See also this TimescaleDB PR

@JLockerman JLockerman mentioned this issue Dec 8, 2020
3 tasks
@JLockerman JLockerman pinned this issue Dec 21, 2020
@JLockerman JLockerman unpinned this issue Dec 21, 2020
@JLockerman
Copy link
Contributor Author

The t-digest implementation is merged. As far as I know there are three remaining open issues before the feature is done:

  1. Is our choice of k() acceptable?
  2. Cleanup of the type and function names
  3. Cleanup of the TEXT input and output for the t-digest type (this is a more general issue for all our types).

@WireBaron did I miss anything?

@WireBaron
Copy link
Contributor

I should convert the comments to proper rustdoc format and make a quick pass over them. I think there's also a serialization issue with not fully populated digests that needs to be tracked down.

@JLockerman
Copy link
Contributor Author

@WireBaron I'll have a quick run through of the code, see if I can find anything

bors bot added a commit that referenced this issue Feb 4, 2021
61: Misc t-digest cleaning r=JLockerman a=JLockerman

I got distracted looking for the serialization issue mentioned in #1, and did some experimental cleaning for the t-digest files, @WireBaron would you let me know what you think?

Rename the type exposed to SQL to `tdigest` instead of `TimescaleTDigest`.
Remove the auxiliary hand-written SQL files in favor of including the SQL directly in the rust files; it should be easier to keep them in sync that way.


Co-authored-by: Joshua Lockerman <josh@timescale.com>
@JLockerman
Copy link
Contributor Author

Am experimental version of this is available in 0.1.0, and should also be available on Forge soon.

@JLockerman
Copy link
Contributor Author

See also #35

@JLockerman
Copy link
Contributor Author

see also #41

@JLockerman
Copy link
Contributor Author

This was released in 1.0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposed-feature A proposed feature or function to ease a task
Projects
None yet
Development

No branches or pull requests

3 participants