Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupKey should be comparable between DataFrames #2639

Closed
ExpandingMan opened this issue Mar 4, 2021 · 12 comments · Fixed by #2669
Closed

GroupKey should be comparable between DataFrames #2639

ExpandingMan opened this issue Mar 4, 2021 · 12 comments · Fixed by #2669
Labels
breaking The proposed change is breaking. feature
Milestone

Comments

@ExpandingMan
Copy link
Contributor

The GroupKey grouped dataframe keys are distinct from a NamedTuple or other "row-like" objects for the sake of efficiency, but currently they are not comparable across dataframes. This is not consistent with other "row-like" objects, all of which are comparable regardless of their parent.

This forbids some reasonable patterns such as

gs1 = groupby(df1, [:A, :B])
gs2 = groupby(df2, [:A, :B])

for k  keys(gs1)
    # currently this would fail
    f(gs2[k])
end

As well as the direct creation of any other type of dictionary using GroupKey as a key.

Now, conversion of the GroupKey is quite simple with NamedTuple, but the question is whether this conversion should be left up to the user.

Note that, if we did change the behavior, some sort of safe but efficient reference check would be need on getindex(::GroupedDataFrame, ::GroupKey) to ensure that a compatible key is being used and if not, to perform the proper conversion.

@ExpandingMan
Copy link
Contributor Author

By the way, it might be worth doing some quick performance tests before seriously considering changing this. While I think the inability to compare GroupKey is rather ugly, the possibility of converting to NamedTuple means this isn't such a big deal. Even though I think it would be nice to change the behavior, I do not think it would be worth it if there were any kind of significant performance barrier.

@bkamins
Copy link
Member

bkamins commented Mar 4, 2021

Performance is not a problem. We can just change current:

function Base.to_index(gd::GroupedDataFrame, key::GroupKey)
    gd === parent(key) && return getfield(key, :idx)
    throw(ErrorException("Cannot use a GroupKey to index a GroupedDataFrame " *
                         "other than the one it was derived from."))
end

to

Base.to_index(gd::GroupedDataFrame, key::GroupKey) =
    gd === parent(key) ? return getfield(key, :idx) : Base.to_index(gd, NamedTuple(key))

the only thing is that it will be slow when mixing data frames (but will be as fast as currently for the same data frame).

Then we would need to define ==, isequal, and hash, and think how haskey and in should work (and make sure we use consistent rules here).

@nalimilan - do you think we should consider these potential changes as breaking?

@bkamins bkamins added breaking The proposed change is breaking. feature labels Mar 4, 2021
@bkamins bkamins added this to the 1.0 milestone Mar 4, 2021
@bkamins
Copy link
Member

bkamins commented Mar 4, 2021

As a general comment - it should be easy to change once we decide what we want.

@bkamins bkamins mentioned this issue Mar 4, 2021
19 tasks
@nalimilan
Copy link
Member

Turning the error into something that works is definitely not breaking. Changing hash and isequal to be comparable across DataFrames can probably be considered as non breaking too (it could break things but in very weird use cases). If we do that, we would define hash and isequal to be consistent with NamedTuple, right?

@bkamins
Copy link
Member

bkamins commented Mar 4, 2021

So the logic is:

  1. if we allow indexing we should allow isequal
  2. if we allow isequal we have to change hash
  3. if we change hash we have to do hashing based on contents and not on object identity as we currently do, so defining it in a consistent way with NamedTuple is as good as any other option (we just need to remember that this means that hash will be slow, while now hash is fast because it is doing hashing based on object ID not contents).

@bkamins
Copy link
Member

bkamins commented Mar 22, 2021

@nalimilan - we should then additionally decide how == and isequal should work when mixing GroupKey and other types. I think we should not compare as true to vectors or Tuple, but we can consider comparing as isequal (and consequently ==) to NamedTuple. If we decided to go this way then hash of GroupKey must match hash of NamedTuple (which will be a bit expensive, but acceptable). What do you think?

@bkamins
Copy link
Member

bkamins commented Mar 22, 2021

Then also the question is if we should add isless for consistency with NamedTuple.

@xgdgsc
Copy link
Contributor

xgdgsc commented Jun 27, 2023

Is there a way to get Namedtuple type of keys directly without the conversion of NamedTuple.(keys(gdf))? Something like nkeys()?

@bkamins
Copy link
Member

bkamins commented Jun 27, 2023

It is not clear what you mean. What would be the difference between NamedTuple.(keys(gdf)) and nkeys()?

@xgdgsc
Copy link
Contributor

xgdgsc commented Jun 27, 2023

Just a shortcut function.

@bkamins
Copy link
Member

bkamins commented Jun 27, 2023

I think it is not needed often enough, and the alternative is simple enough, so adding it is not needed. @nalimilan - what do you think?

@nalimilan
Copy link
Member

I agree it doesn't seem worth it. We try to limit the API surface to keeps things manageable for users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking. feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants