Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding min_n and max_n aggregates #590

Merged
merged 1 commit into from
Oct 29, 2022
Merged

Adding min_n and max_n aggregates #590

merged 1 commit into from
Oct 29, 2022

Conversation

WireBaron
Copy link
Contributor

@WireBaron WireBaron commented Oct 21, 2022

This adds new aggregates min_n and max_n for getting the n largest or smallest values from a column. It will work with integer, float, and timestamptz values. These functions will return an aggregate object which can be combined with other such objects via rollup. The data can be extracted from the aggregate via into_array or into_values methods, which return either an array of the values, or a table containing them.

It further adds min_n_by and max_n_by functions that take one of the above types plus an associated piece of data (takes this as an AnyElement, so can be any type). This will behave the same as the min_n/max_n above, but will also return the associated data for the smallest or largest elements. into_array is not implemented for these aggregates, as it's not clear what that array would look like, and into_values will require a value of the appropriate type as an input to allow postgres to determine the function output (suggested approach is to just cast a NULL to the type of the associated data).

Fixes #511

extension/src/minn_maxn.rs Outdated Show resolved Hide resolved
extension/src/minn_maxn.rs Outdated Show resolved Hide resolved
extension/src/minn_maxn.rs Outdated Show resolved Hide resolved
extension/src/minn_maxn.rs Outdated Show resolved Hide resolved
extension/src/minn_maxn.rs Outdated Show resolved Hide resolved
type FloatMinTransType = NMostTransState<NotNan<f64>>;
type FloatMaxTransType = NMostTransState<Reverse<NotNan<f64>>>;
type TimeMinTransType = NMostTransState<pg_sys::TimestampTz>;
type TimeMaxTransType = NMostTransState<Reverse<pg_sys::TimestampTz>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about keeping this file for the common type and functions and creating one child module file per type? I think it would not only make this file and each individual implementation easier to read, but if kept in the same order, it makes it easy to diff the implementations to spot interesting differences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't say I like it, but I think I dislike it less than any of the alternatives.

use std::collections::BinaryHeap;

#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct NMostTransState<T: Ord> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module is named "minn_maxn", functions are named like "min_max_trans_function", but this is named "NMost". Can we rename this to match? I think I like "NMost" better so maybe rename the module and functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with nmost for the generic functions, the specialized types and functions all follow min_int or min_by_int.

}
}

fn min_max_rollup_trans_function<T: Ord + Copy>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to add rollup tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to all unit tests.

Copy link
Contributor

@epgts epgts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of some more questions but I'm not trying to block merging with them :)

}
}

#[pg_extern(schema = "toolkit_experimental", immutable, parallel_safe)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistic question: why do we sometimes do this rather than just move all these functions into the toolkit_experimental module? It's always confused me that we use the two different ways to put something into a schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know, I've actually just sort of always followed the existing code and not really thought about it. I had thought that the schema name was needed for functions to make pgx put them in the right space, but it seems to do the right thing if we just put it in the module as well. Maybe this wasn't as reliable in a previous version of pgx?

.heap
.peek()
.expect("Can't be empty in this case");
let _old_datum = std::mem::replace(&mut self.data[index_to_replace], unsafe {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use std::mem::replace when we don't want the previous value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The replace was so that we could free the old datum, but I had just left that as a todo.

let _old_datum = std::mem::replace(&mut self.data[index_to_replace], unsafe {
deep_copy_datum(new_element.datum(), new_element.oid())
});
// TODO: should we cleaning up 'old_datum' here? There don't seem to be any convenience functions for this...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a potential leak? I don't see where Datum implements Drop...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, everything here is done in the aggregate context, so the memory will be reclaimed with the aggregate finishes, which should be fine in almost every case. However, I couldn't quite convince myself that we weren't at risk of running out of memory if we ran this on a very large dataset, sorted such that every new element replaces an old value. So I added a free_datum function that I'm now using here.

This adds new aggregates min_n and max_n for getting the n largest or smallest
values from a column. It will work with integer, float, and timestamptz values.
These functions will return an aggregate object which can be combined with
other such objects via rollup. The data can be extracted from the aggregate
via into_array or into_values methods, which return either an array of the
values, or a table containing them.

It further adds min_n_by and max_n_by functions that take one of the above
types plus an associated piece of data (takes this as an AnyElement, so can be
any type). This will behave the same as the min_n/max_n above, but will also
return the associated data for the smallest or largest elements. into_array is
not implemented for these aggregates, as it's not clear what that array would
look like, and into_values will require a value of the appropriate type as an
input to allow postgres to determine the function output (suggested approach is
to just cast a NULL to the type of the associated data).
@WireBaron
Copy link
Contributor Author

bors r+

@bors
Copy link
Contributor

bors bot commented Oct 29, 2022

Build succeeded:

@bors bors bot merged commit 6791f1b into main Oct 29, 2022
@bors bors bot deleted the br/min_max_n_by branch October 29, 2022 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

maxn_n / min_n
4 participants