`GroupKey` should be comparable between DataFrames #2639

ExpandingMan · 2021-03-04T00:15:24Z

The GroupKey grouped dataframe keys are distinct from a NamedTuple or other "row-like" objects for the sake of efficiency, but currently they are not comparable across dataframes. This is not consistent with other "row-like" objects, all of which are comparable regardless of their parent.

This forbids some reasonable patterns such as

gs1 = groupby(df1, [:A, :B])
gs2 = groupby(df2, [:A, :B])

for k ∈ keys(gs1)
    # currently this would fail
    f(gs2[k])
end

As well as the direct creation of any other type of dictionary using GroupKey as a key.

Now, conversion of the GroupKey is quite simple with NamedTuple, but the question is whether this conversion should be left up to the user.

Note that, if we did change the behavior, some sort of safe but efficient reference check would be need on getindex(::GroupedDataFrame, ::GroupKey) to ensure that a compatible key is being used and if not, to perform the proper conversion.

The text was updated successfully, but these errors were encountered:

ExpandingMan · 2021-03-04T00:21:59Z

By the way, it might be worth doing some quick performance tests before seriously considering changing this. While I think the inability to compare GroupKey is rather ugly, the possibility of converting to NamedTuple means this isn't such a big deal. Even though I think it would be nice to change the behavior, I do not think it would be worth it if there were any kind of significant performance barrier.

bkamins · 2021-03-04T08:13:26Z

Performance is not a problem. We can just change current:

function Base.to_index(gd::GroupedDataFrame, key::GroupKey)
    gd === parent(key) && return getfield(key, :idx)
    throw(ErrorException("Cannot use a GroupKey to index a GroupedDataFrame " *
                         "other than the one it was derived from."))
end

to

Base.to_index(gd::GroupedDataFrame, key::GroupKey) =
    gd === parent(key) ? return getfield(key, :idx) : Base.to_index(gd, NamedTuple(key))

the only thing is that it will be slow when mixing data frames (but will be as fast as currently for the same data frame).

Then we would need to define ==, isequal, and hash, and think how haskey and in should work (and make sure we use consistent rules here).

@nalimilan - do you think we should consider these potential changes as breaking?

bkamins · 2021-03-04T08:14:05Z

As a general comment - it should be easy to change once we decide what we want.

nalimilan · 2021-03-04T14:42:49Z

Turning the error into something that works is definitely not breaking. Changing hash and isequal to be comparable across DataFrames can probably be considered as non breaking too (it could break things but in very weird use cases). If we do that, we would define hash and isequal to be consistent with NamedTuple, right?

bkamins · 2021-03-04T15:16:51Z

So the logic is:

if we allow indexing we should allow isequal
if we allow isequal we have to change hash
if we change hash we have to do hashing based on contents and not on object identity as we currently do, so defining it in a consistent way with NamedTuple is as good as any other option (we just need to remember that this means that hash will be slow, while now hash is fast because it is doing hashing based on object ID not contents).

bkamins · 2021-03-22T16:26:35Z

@nalimilan - we should then additionally decide how == and isequal should work when mixing GroupKey and other types. I think we should not compare as true to vectors or Tuple, but we can consider comparing as isequal (and consequently ==) to NamedTuple. If we decided to go this way then hash of GroupKey must match hash of NamedTuple (which will be a bit expensive, but acceptable). What do you think?

bkamins · 2021-03-22T16:27:50Z

Then also the question is if we should add isless for consistency with NamedTuple.

xgdgsc · 2023-06-27T08:06:19Z

Is there a way to get Namedtuple type of keys directly without the conversion of NamedTuple.(keys(gdf))? Something like nkeys()?

bkamins · 2023-06-27T08:08:19Z

It is not clear what you mean. What would be the difference between NamedTuple.(keys(gdf)) and nkeys()?

xgdgsc · 2023-06-27T08:50:40Z

Just a shortcut function.

bkamins · 2023-06-27T13:40:53Z

I think it is not needed often enough, and the alternative is simple enough, so adding it is not needed. @nalimilan - what do you think?

nalimilan · 2023-06-29T20:04:29Z

I agree it doesn't seem worth it. We try to limit the API surface to keeps things manageable for users.

bkamins added breaking The proposed change is breaking. feature labels Mar 4, 2021

bkamins added this to the 1.0 milestone Mar 4, 2021

bkamins mentioned this issue Mar 4, 2021

Release 1.0 tracking #2640

Closed

19 tasks

bkamins mentioned this issue Mar 22, 2021

add in for GroupKeys #2392

Merged

This was referenced Mar 22, 2021

DataFrameRow and NamedTuple comparisons #2668

Closed

add ==, isequal <, and isless for DataFrameRow and GroupKey #2669

Merged

bkamins closed this as completed in #2669 Mar 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`GroupKey` should be comparable between DataFrames #2639

`GroupKey` should be comparable between DataFrames #2639

ExpandingMan commented Mar 4, 2021

ExpandingMan commented Mar 4, 2021

bkamins commented Mar 4, 2021

bkamins commented Mar 4, 2021

nalimilan commented Mar 4, 2021

bkamins commented Mar 4, 2021

bkamins commented Mar 22, 2021

bkamins commented Mar 22, 2021

xgdgsc commented Jun 27, 2023

bkamins commented Jun 27, 2023

xgdgsc commented Jun 27, 2023

bkamins commented Jun 27, 2023

nalimilan commented Jun 29, 2023

GroupKey should be comparable between DataFrames #2639

GroupKey should be comparable between DataFrames #2639

Comments

ExpandingMan commented Mar 4, 2021

ExpandingMan commented Mar 4, 2021

bkamins commented Mar 4, 2021

bkamins commented Mar 4, 2021

nalimilan commented Mar 4, 2021

bkamins commented Mar 4, 2021

bkamins commented Mar 22, 2021

bkamins commented Mar 22, 2021

xgdgsc commented Jun 27, 2023

bkamins commented Jun 27, 2023

xgdgsc commented Jun 27, 2023

bkamins commented Jun 27, 2023

nalimilan commented Jun 29, 2023

`GroupKey` should be comparable between DataFrames #2639

`GroupKey` should be comparable between DataFrames #2639