-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding min_n and max_n aggregates #590
Conversation
extension/src/minn_maxn.rs
Outdated
type FloatMinTransType = NMostTransState<NotNan<f64>>; | ||
type FloatMaxTransType = NMostTransState<Reverse<NotNan<f64>>>; | ||
type TimeMinTransType = NMostTransState<pg_sys::TimestampTz>; | ||
type TimeMaxTransType = NMostTransState<Reverse<pg_sys::TimestampTz>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about keeping this file for the common type and functions and creating one child module file per type? I think it would not only make this file and each individual implementation easier to read, but if kept in the same order, it makes it easy to diff the implementations to spot interesting differences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't say I like it, but I think I dislike it less than any of the alternatives.
extension/src/minn_maxn.rs
Outdated
use std::collections::BinaryHeap; | ||
|
||
#[derive(Clone, Debug, Serialize, Deserialize)] | ||
pub struct NMostTransState<T: Ord> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The module is named "minn_maxn", functions are named like "min_max_trans_function", but this is named "NMost". Can we rename this to match? I think I like "NMost" better so maybe rename the module and functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with nmost
for the generic functions, the specialized types and functions all follow min_int
or min_by_int
.
extension/src/minn_maxn.rs
Outdated
} | ||
} | ||
|
||
fn min_max_rollup_trans_function<T: Ord + Copy>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you going to add rollup tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to all unit tests.
d268ef9
to
44899d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought of some more questions but I'm not trying to block merging with them :)
} | ||
} | ||
|
||
#[pg_extern(schema = "toolkit_experimental", immutable, parallel_safe)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistic question: why do we sometimes do this rather than just move all these functions into the toolkit_experimental module? It's always confused me that we use the two different ways to put something into a schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know, I've actually just sort of always followed the existing code and not really thought about it. I had thought that the schema name was needed for functions to make pgx put them in the right space, but it seems to do the right thing if we just put it in the module as well. Maybe this wasn't as reliable in a previous version of pgx?
extension/src/nmost.rs
Outdated
.heap | ||
.peek() | ||
.expect("Can't be empty in this case"); | ||
let _old_datum = std::mem::replace(&mut self.data[index_to_replace], unsafe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use std::mem::replace when we don't want the previous value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The replace was so that we could free the old datum, but I had just left that as a todo.
extension/src/nmost.rs
Outdated
let _old_datum = std::mem::replace(&mut self.data[index_to_replace], unsafe { | ||
deep_copy_datum(new_element.datum(), new_element.oid()) | ||
}); | ||
// TODO: should we cleaning up 'old_datum' here? There don't seem to be any convenience functions for this... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a potential leak? I don't see where Datum implements Drop...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, everything here is done in the aggregate context, so the memory will be reclaimed with the aggregate finishes, which should be fine in almost every case. However, I couldn't quite convince myself that we weren't at risk of running out of memory if we ran this on a very large dataset, sorted such that every new element replaces an old value. So I added a free_datum function that I'm now using here.
This adds new aggregates min_n and max_n for getting the n largest or smallest values from a column. It will work with integer, float, and timestamptz values. These functions will return an aggregate object which can be combined with other such objects via rollup. The data can be extracted from the aggregate via into_array or into_values methods, which return either an array of the values, or a table containing them. It further adds min_n_by and max_n_by functions that take one of the above types plus an associated piece of data (takes this as an AnyElement, so can be any type). This will behave the same as the min_n/max_n above, but will also return the associated data for the smallest or largest elements. into_array is not implemented for these aggregates, as it's not clear what that array would look like, and into_values will require a value of the appropriate type as an input to allow postgres to determine the function output (suggested approach is to just cast a NULL to the type of the associated data).
48bb825
to
9fe5b8f
Compare
bors r+ |
Build succeeded:
|
This adds new aggregates
min_n
andmax_n
for getting the n largest or smallest values from a column. It will work withinteger
,float
, andtimestamptz
values. These functions will return an aggregate object which can be combined with other such objects viarollup
. The data can be extracted from the aggregate viainto_array
orinto_values
methods, which return either an array of the values, or a table containing them.It further adds
min_n_by
andmax_n_by
functions that take one of the above types plus an associated piece of data (takes this as anAnyElement
, so can be any type). This will behave the same as themin_n
/max_n
above, but will also return the associated data for the smallest or largest elements.into_array
is not implemented for these aggregates, as it's not clear what that array would look like, andinto_values
will require a value of the appropriate type as an input to allow postgres to determine the function output (suggested approach is to just cast aNULL
to the type of the associated data).Fixes #511