Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Groupby hierarchical aggregations are slow #1697

Closed
vladikir opened this issue May 9, 2019 · 3 comments
Closed

[BUG] Groupby hierarchical aggregations are slow #1697

vladikir opened this issue May 9, 2019 · 3 comments
Labels
bug Something isn't working Needs Triage Need team to review and classify

Comments

@vladikir
Copy link

vladikir commented May 9, 2019

Describe the bug
Groupby hierarchical aggregations appear to be significantly slower than Pandas

Steps/Code to reproduce bug
import pandas as pd, cudf
import numpy as np
pdf = pd.DataFrame({'x': np.random.randint(0, 30, size=3000000),
'y': np.random.randint(0, 100, size=3000000),
'z': np.random.randint(0, 2000, size=3000000),
's': np.random.randint(0, 100, size=3000000)})
gdf = cudf.DataFrame.from_pandas(pdf)
%timeit pdf.groupby(['x', 'y', 'z']).sum() # around 1.36 s
%timeit gdf.groupby(['x', 'y', 'z']).sum() # around 8.12 s

Expected behavior
Expected to be faster then Pandas

Environment details (please complete the following information):

  • Environment location: Google Colab Cloud

  • Method of cuDF install: conda
    !conda install -q -y --prefix /usr/local -c conda-forge
    -c rapidsai-nightly/label/cuda10.0 -c nvidia/label/cuda10.0
    cudf cuml

  • Please run and attach the output of the cudf/print_env.sh script to gather relevant environment details
    Could not find location by ! ls / -l -R | grep 'cudf/print_env.sh'.
    Thu May 9 15:51:03 2019
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 418.56 Driver Version: 410.79 CUDA Version: 10.0 |
    |-------------------------------+----------------------+----------------------+
    | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
    | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
    |===============================+======================+======================|
    | 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
    | N/A 58C P0 28W / 70W | 1333MiB / 15079MiB | 0% Default |
    +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+

@vladikir vladikir added Needs Triage Need team to review and classify bug Something isn't working labels May 9, 2019
@jrhemstad
Copy link
Contributor

I believe this is a duplicate of #1685 and is currently being fixed.

@kkraus14
Copy link
Collaborator

kkraus14 commented May 9, 2019

@vladikir could you try with gdf.groupby(['x', 'y', 'z'], as_index=False).sum() instead just to confirm it's the MultiIndex performance issue noted above?

@kkraus14
Copy link
Collaborator

Closing as duplicate of #1685

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Needs Triage Need team to review and classify
Projects
None yet
Development

No branches or pull requests

3 participants