Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Fix bug handling multi-index empty groupby and add multi-index slicing #1659

Merged
merged 19 commits into from
May 8, 2019

Conversation

ayushdg
Copy link
Member

@ayushdg ayushdg commented May 7, 2019

This pr is a combination of the work here: #1639 + adding the ability to slice multiiindex objects.

This should make the multi index groupby more compatible with dask groupbys on Dask_cudf dataframes.

Thank you @thomcom for the empty Dataframe updates for groupby. I have used most of them as is from your PR

@ayushdg ayushdg marked this pull request as ready for review May 8, 2019 05:22
@ayushdg ayushdg requested a review from a team as a code owner May 8, 2019 05:22
@ayushdg ayushdg changed the title [WIP] Fix bug handling multi-index empty groupby and add multi-index slicing [REVIEW] Fix bug handling multi-index empty groupby and add multi-index slicing May 8, 2019
# unprefix columns
new_cols = []
for c in result.columns:
new_col = c.split('_')[1] # sum_z-> (sum, z)
Copy link
Contributor

@shwina shwina May 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps to accommodate column names like person_name, we can do instead:

new_col = c.split('_', 1)[1]

This will split 'sum_person_name' into ['sum', 'person_name']

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! That is my mistake. I'll fix shortly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make that correction in my upcoming PR.


# converting levels to numpy array will produce a Float64Index
# (on empty levels)for levels mimicking the behavior of Pandas
self.levels = np.array(levels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ended up being a significant performance regression. See the fix in #1689. :D

@thomcom thomcom merged commit 17ef8f5 into rapidsai:branch-0.7 May 8, 2019
@thomcom
Copy link
Contributor

thomcom commented May 8, 2019

🔥

@thomcom
Copy link
Contributor

thomcom commented May 8, 2019

Also fixes rapidsai/dask-cudf#231

@ayushdg ayushdg deleted the bug-cudf-mutiindex-groupby branch May 8, 2019 23:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants