Adding min_n and max_n aggregates #590

WireBaron · 2022-10-21T19:11:57Z

This adds new aggregates min_n and max_n for getting the n largest or smallest values from a column. It will work with integer, float, and timestamptz values. These functions will return an aggregate object which can be combined with other such objects via rollup. The data can be extracted from the aggregate via into_array or into_values methods, which return either an array of the values, or a table containing them.

It further adds min_n_by and max_n_by functions that take one of the above types plus an associated piece of data (takes this as an AnyElement, so can be any type). This will behave the same as the min_n/max_n above, but will also return the associated data for the smallest or largest elements. into_array is not implemented for these aggregates, as it's not clear what that array would look like, and into_values will require a value of the appropriate type as an input to allow postgres to determine the function output (suggested approach is to just cast a NULL to the type of the associated data).

Fixes #511

extension/src/minn_maxn.rs

epgts · 2022-10-24T17:23:51Z

extension/src/minn_maxn.rs

+type FloatMinTransType = NMostTransState<NotNan<f64>>;
+type FloatMaxTransType = NMostTransState<Reverse<NotNan<f64>>>;
+type TimeMinTransType = NMostTransState<pg_sys::TimestampTz>;
+type TimeMaxTransType = NMostTransState<Reverse<pg_sys::TimestampTz>>;


What do you think about keeping this file for the common type and functions and creating one child module file per type? I think it would not only make this file and each individual implementation easier to read, but if kept in the same order, it makes it easy to diff the implementations to spot interesting differences.

Can't say I like it, but I think I dislike it less than any of the alternatives.

epgts · 2022-10-24T17:25:24Z

extension/src/minn_maxn.rs

+use std::collections::BinaryHeap;
+
+#[derive(Clone, Debug, Serialize, Deserialize)]
+pub struct NMostTransState<T: Ord> {


The module is named "minn_maxn", functions are named like "min_max_trans_function", but this is named "NMost". Can we rename this to match? I think I like "NMost" better so maybe rename the module and functions?

I went with nmost for the generic functions, the specialized types and functions all follow min_int or min_by_int.

epgts · 2022-10-24T17:28:47Z

extension/src/minn_maxn.rs

+    }
+}
+
+fn min_max_rollup_trans_function<T: Ord + Copy>(


Are you going to add rollup tests?

Added to all unit tests.

epgts

I thought of some more questions but I'm not trying to block merging with them :)

epgts · 2022-10-28T15:53:30Z

extension/src/nmost/max_by_float.rs

+    }
+}
+
+#[pg_extern(schema = "toolkit_experimental", immutable, parallel_safe)]


Stylistic question: why do we sometimes do this rather than just move all these functions into the toolkit_experimental module? It's always confused me that we use the two different ways to put something into a schema.

You know, I've actually just sort of always followed the existing code and not really thought about it. I had thought that the schema name was needed for functions to make pgx put them in the right space, but it seems to do the right thing if we just put it in the module as well. Maybe this wasn't as reliable in a previous version of pgx?

epgts · 2022-10-28T15:57:17Z

extension/src/nmost.rs

+                .heap
+                .peek()
+                .expect("Can't be empty in this case");
+            let _old_datum = std::mem::replace(&mut self.data[index_to_replace], unsafe {


Why use std::mem::replace when we don't want the previous value?

The replace was so that we could free the old datum, but I had just left that as a todo.

epgts · 2022-10-28T15:59:19Z

extension/src/nmost.rs

+            let _old_datum = std::mem::replace(&mut self.data[index_to_replace], unsafe {
+                deep_copy_datum(new_element.datum(), new_element.oid())
+            });
+            // TODO: should we cleaning up 'old_datum' here?  There don't seem to be any convenience functions for this...


Is this a potential leak? I don't see where Datum implements Drop...

Well, everything here is done in the aggregate context, so the memory will be reclaimed with the aggregate finishes, which should be fine in almost every case. However, I couldn't quite convince myself that we weren't at risk of running out of memory if we ran this on a very large dataset, sorted such that every new element replaces an old value. So I added a free_datum function that I'm now using here.

This adds new aggregates min_n and max_n for getting the n largest or smallest values from a column. It will work with integer, float, and timestamptz values. These functions will return an aggregate object which can be combined with other such objects via rollup. The data can be extracted from the aggregate via into_array or into_values methods, which return either an array of the values, or a table containing them. It further adds min_n_by and max_n_by functions that take one of the above types plus an associated piece of data (takes this as an AnyElement, so can be any type). This will behave the same as the min_n/max_n above, but will also return the associated data for the smallest or largest elements. into_array is not implemented for these aggregates, as it's not clear what that array would look like, and into_values will require a value of the appropriate type as an input to allow postgres to determine the function output (suggested approach is to just cast a NULL to the type of the associated data).

WireBaron · 2022-10-29T17:49:56Z

bors r+

bors · 2022-10-29T18:01:57Z

Build succeeded:

WireBaron requested review from rtwalker, syvb, thatzopoulos and epgts October 21, 2022 19:11

syvb reviewed Oct 21, 2022

View reviewed changes

extension/src/minn_maxn.rs Outdated Show resolved Hide resolved

extension/src/minn_maxn.rs Outdated Show resolved Hide resolved

syvb reviewed Oct 21, 2022

View reviewed changes

extension/src/minn_maxn.rs Outdated Show resolved Hide resolved

extension/src/minn_maxn.rs Outdated Show resolved Hide resolved

extension/src/minn_maxn.rs Outdated Show resolved Hide resolved

epgts reviewed Oct 24, 2022

View reviewed changes

WireBaron force-pushed the br/min_max_n_by branch from d268ef9 to 44899d2 Compare October 28, 2022 04:08

epgts approved these changes Oct 28, 2022

View reviewed changes

thatzopoulos approved these changes Oct 28, 2022

View reviewed changes

syvb approved these changes Oct 28, 2022

View reviewed changes

WireBaron force-pushed the br/min_max_n_by branch from 48bb825 to 9fe5b8f Compare October 29, 2022 17:48

bors bot merged commit 6791f1b into main Oct 29, 2022

bors bot deleted the br/min_max_n_by branch October 29, 2022 18:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding min_n and max_n aggregates #590

Adding min_n and max_n aggregates #590

WireBaron commented Oct 21, 2022 •

edited

Loading

epgts Oct 24, 2022

WireBaron Oct 27, 2022

epgts Oct 24, 2022

WireBaron Oct 27, 2022

epgts Oct 24, 2022

WireBaron Oct 27, 2022

epgts left a comment

epgts Oct 28, 2022

WireBaron Oct 29, 2022

epgts Oct 28, 2022

WireBaron Oct 29, 2022

epgts Oct 28, 2022

WireBaron Oct 29, 2022

WireBaron commented Oct 29, 2022

bors bot commented Oct 29, 2022

Adding min_n and max_n aggregates #590

Adding min_n and max_n aggregates #590

Conversation

WireBaron commented Oct 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

epgts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WireBaron commented Oct 29, 2022

bors bot commented Oct 29, 2022

WireBaron commented Oct 21, 2022 •

edited

Loading