-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(transform): task to apply reduction ops to a container #241
Conversation
0e1796a
to
3fd9659
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any obvious flaws in this implementation (modulo a few points where I got confused and wrote comments), but the logic is sufficiently complicated that it would be good to have someone else take a look
9fd3e58
to
a4ba5db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to echo Simon's comments here. This is nice, and I think could be broadly useful, but I fear it might be too smart for its own good. The actual task we want to do (variance over freq, variance over elevation) could be implemented in probably < 20 lines. This generic version is 300, and might not be flexible enough for what we want.
One of my fears is that it's pretty hard to specify a sensible generic reduction of the data when you would like to be able to use different weightings (unweighted, masked, and weighted) and have things turn out sensibly. Some of the common reductions have fairly sensible definitions (e.g. mean, variance), but others don't, i.e. how do you do a weighted prod
, or a weighted min
?
I'm not sure how exactly to go about simplifying this. One thing I might suggest is defining a common API, like:
def reduction(data: np.ndarray, weight: np.ndarray, axis: int, weighting: Literal["none", "masked", "weighted"]) -> tuple[np.ndarray, np.ndarray]:
"""Perform a reduction for data and weight to return a reduced data and weight."""
...
and then having specific implementations for each of a more limited set of operations you actually want to support and be well defined.
Something like:
def reduction_var(data: np.ndarray, weight: np.ndarray, axis: int, weighting: Literal["none", "masked", "weighted"]) -> tuple[np.ndarray, np.ndarray]:
if weighting == "none":
v = data.var(axis=axis)
else:
if weighting == "masked":
w = (weight > 0).astype(weight.dtype)
else:
w = weight.copy()
w *= w.sum(axis=axis) / w.shape[axis]
v = np.var(w**0.5 * v, axis=axis)
return v, np.ones_like(v)
That above is a little ugly, so maybe there's a better approach, but I imagine you get the vague idea.
df01fe1
to
d9eaf0f
Compare
2e64493
to
7da4aed
Compare
ed63c39
to
86dc848
Compare
2298e91
to
5b9e99f
Compare
9d31e3d
to
5eabac1
Compare
7076057
to
b145a27
Compare
b145a27
to
0e7da70
Compare
0e7da70
to
715bba4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks okay. It might be worth adding a comment to _make_output_container
pointing out the choice of index_map[0]
for the value of the rolled-up axis.
715bba4
to
82fdc54
Compare
Apply a reduction op across up to n-1 axes of a rank-n dataset. The reduction can be applied to more than one dataset in the container, and axis downselection can be applied to n-1 axes as well. A weight mask can optionally be generated, which masks values where the weights are zero and is broadcast to the dataset shape if possible. This mask can also be applied before doing the reduction, using a masked array.
The task tries to optimize redistribution by first trying to minimize the number of redistributions that happen and then trying to choose the best possible axes to redistribute over if needed. This is just being done by an inverse sort by axis length.
I've tested this a fair bit, but I wouldn't be surprised if there are still some edge cases that could break. It generally works as expected when being used to generate variance-over-freq and variance-over-el datasets in the chime daily pipeline.
Requires radiocosmology/caput#242