Approximate Percentile #1

JLockerman · 2020-12-03T22:08:37Z

What's the functionality you would like to add
An approximate percentile function such as t-digest. This would have two main advantages over percentile_cont:

It can use less space when calculating over large datasets.
The digesting part can be separated from the calculation of the percentile, allowing post-hoc analysis, and more powerful VIEWs.
It can constructed in a way that can be use with continuous aggregates.

How would the function be used

Basic percentile calculation works just like exat percentile calculation, expect it takes in an accuracy measure (for t-digest the number of centroids)

-- example dataset
SELECT * FROM data;
          time          | value 
------------------------+-------
 2020-01-03 01:00:00-05 |   1.0
 2020-01-03 01:10:00-05 |   2.0
 2020-01-03 01:20:00-05 |   3.0
 2020-01-03 01:30:00-05 |   4.0
 2020-01-03 01:40:00-05 |   5.0
           ...
 2020-01-03 16:50:00-05 |  96.0
 2020-01-03 17:00:00-05 |  97.0
 2020-01-03 17:10:00-05 |  98.0
 2020-01-03 17:20:00-05 |  99.0
 2020-01-03 17:30:00-05 | 100.0
(10 rows)

-- regular approximate percentile call
SELECT approx_percentile(value, 0.50, accuracy => 10) FROM data;
 approx_percentile
-------------------
              51.5
(1 row)

-- increasing the number of centroids increases the accuracy at the cost of more storage
SELECT approx_percentile(value, 0.50, accuracy => 100) FROM data;
 approx_percentile
-------------------
              50.5
(1 row)

When storing the data, for instance in continuous aggregates, the digest itself can be stored, instead of the percentile, allowing future analysis to chose which data it wants to get out.

-- example dataset
SELECT * FROM data;
          time          | value 
------------------------+-------
 2020-01-03 01:00:00-05 |   1.0
 2020-01-03 01:10:00-05 |   2.0
 2020-01-03 01:20:00-05 |   3.0
 2020-01-03 01:30:00-05 |   4.0
 2020-01-03 01:40:00-05 |   5.0
           ...
 2020-01-03 16:50:00-05 |  96.0
 2020-01-03 17:00:00-05 |  97.0
 2020-01-03 17:10:00-05 |  98.0
 2020-01-03 17:20:00-05 |  99.0
 2020-01-03 17:30:00-05 | 100.0
(100 rows)

-- we output the /digest/ not the calculated percentile, with a fixed accuracy
-- in real usage this would likely be a continuous aggregate instead of a regular view
CREATE VIEW aggregated AS SELECT precentile_digest(value, accuracy => 100) FROM data;
INSERT 0 1

-- we can output various percentiles from the same digest
SELECT
    approx_percentile(digest, 0.10) tenth,
    approx_percentile(digest, 0.50) median,
    approx_percentile(digest, 0.50) ninetieth
FROM data;
 tenth | median | ninetieth
-------+--------+-----------
  10.9 |   50.5 |  90.10001
(1 row)

-- and also other certain other outputs, depending on the digest
SELECT avg(digest), min_val(digest), max_val(digest), num_elements(digest) FROM aggregated;
  avg | min |  max  | num_elements
------+-----+-------+--------------
 50.5 | 1.0 | 100.0 |          100
(1 row)

Open Questions
TBD

Remaining Work

Is our choice of k() acceptable?
Cleanup of the type and function names
Cleanup of the TEXT input and output for the t-digest type (this is a more general issue for all our types).
Are t-digest merges handled correctly if they have different creation parameters?
Are there any outstanding serialization issues?

The text was updated successfully, but these errors were encountered:

JLockerman · 2020-12-08T17:26:53Z

See also this TimescaleDB PR

JLockerman · 2021-02-02T16:06:00Z

The t-digest implementation is merged. As far as I know there are three remaining open issues before the feature is done:

Is our choice of k() acceptable?
Cleanup of the type and function names
Cleanup of the TEXT input and output for the t-digest type (this is a more general issue for all our types).

@WireBaron did I miss anything?

WireBaron · 2021-02-03T19:43:37Z

I should convert the comments to proper rustdoc format and make a quick pass over them. I think there's also a serialization issue with not fully populated digests that needs to be tracked down.

JLockerman · 2021-02-03T20:02:51Z

@WireBaron I'll have a quick run through of the code, see if I can find anything

@WireBaron

61: Misc t-digest cleaning r=JLockerman a=JLockerman I got distracted looking for the serialization issue mentioned in #1, and did some experimental cleaning for the t-digest files, @WireBaron would you let me know what you think? Rename the type exposed to SQL to `tdigest` instead of `TimescaleTDigest`. Remove the auxiliary hand-written SQL files in favor of including the SQL directly in the rust files; it should be easier to keep them in sync that way. Co-authored-by: Joshua Lockerman <josh@timescale.com>

JLockerman · 2021-03-04T21:20:27Z

Am experimental version of this is available in 0.1.0, and should also be available on Forge soon.

JLockerman · 2021-04-06T19:14:18Z

see also #41

JLockerman · 2021-07-15T17:52:24Z

This was released in 1.0.0

JLockerman added the proposed-feature A proposed feature or function to ease a task label Dec 3, 2020

JLockerman mentioned this issue Dec 8, 2020

Median Aggregate #7

Open

3 tasks

JLockerman pinned this issue Dec 21, 2020

JLockerman unpinned this issue Dec 21, 2020

JLockerman mentioned this issue Feb 3, 2021

Misc t-digest cleaning #61

Merged

davidkohn88 mentioned this issue Feb 19, 2021

ORDER BY is not supported by continuous view timescale/timescaledb#2872

Open

This was referenced Apr 6, 2021

Support UDDSketch (aggregation) #41

Closed

First Stabilization Checklist #107

Closed

Loquacity mentioned this issue Jun 7, 2021

New Analytics section timescale/docs#107

Merged

3 tasks

Loquacity self-assigned this Jun 8, 2021

JLockerman closed this as completed Jul 15, 2021

ericcoulthard mentioned this issue Oct 19, 2021

timevector broken for compressed chunks #279

Closed

epgts mentioned this issue Sep 13, 2022

Move update-tester tests to markdown files #522

Merged

epgts mentioned this issue Feb 9, 2023

Speed up tools/testbin #563

Open

snyrkill mentioned this issue Oct 6, 2023

Cargo error when building #777

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Approximate Percentile #1

Approximate Percentile #1

JLockerman commented Dec 3, 2020 •

edited

Loading

JLockerman commented Dec 8, 2020

JLockerman commented Feb 2, 2021

WireBaron commented Feb 3, 2021

JLockerman commented Feb 3, 2021

JLockerman commented Mar 4, 2021

JLockerman commented Apr 6, 2021

JLockerman commented Apr 6, 2021

JLockerman commented Jul 15, 2021

Approximate Percentile #1

Approximate Percentile #1

Comments

JLockerman commented Dec 3, 2020 • edited Loading

JLockerman commented Dec 8, 2020

JLockerman commented Feb 2, 2021

WireBaron commented Feb 3, 2021

JLockerman commented Feb 3, 2021

JLockerman commented Mar 4, 2021

JLockerman commented Apr 6, 2021

JLockerman commented Apr 6, 2021

JLockerman commented Jul 15, 2021

JLockerman commented Dec 3, 2020 •

edited

Loading