-
Notifications
You must be signed in to change notification settings - Fork 887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
groupby() super slow on branch-0.7[BUG] #1685
Comments
Yep full repro in #1329 |
I suspect this is related to #1673 as well. |
I'll go back and compare some results with branch-0.6. Floats are also stupendously slow. Ints appear to be v. fast:
|
It may be related to multi-column groupby(). I test the single column and two columns in groupby(). df2 = df[['cat1','cnt']]
df3 = df[['cat1','cat','cnt']]
%timeit df2 = df2.groupby(['cat1']).agg({'cnt': 'count'})
%time df3 = df3.groupby(['cat1','cat']).agg({'cnt': 'count'}) Time output:
|
I bet you'll have bad numbers on |
@lmeyerov the performance is acceptable on
|
Mike, I think you just showed the reverse --- 0.7 shows a 5X slowdown . The more columns the df has, even if unused, the slower it goes.
|
Hi @thomcom, I try the 0.7 again. It improves the performance in small dataset. When I try the larger dataset, it is still slow. And when I print the df, it generates errors. #Branch-0.7
len(df)
62914500
%time df1 = df.groupby(['cat1','cat','id','doc_length'], method='hash').count()
CPU times: user 17.7 s, sys: 12.3 s, total: 30 s
Wall time: 30.3 s
print(df1)
/conda/lib/python3.7/site-packages/cudf-0.7.0.dev0+1567.gffac3e1.dirty-py3.7-linux-x86_64.egg/cudf/dataframe/dataframe.py in __str__(self)
448 nrows = settings.formatting.get('nrows') or 10
449 ncols = settings.formatting.get('ncols') or 8
--> 450 return self.to_string(nrows=nrows, ncols=ncols)
451
452 def __repr__(self):
/conda/lib/python3.7/site-packages/cudf-0.7.0.dev0+1567.gffac3e1.dirty-py3.7-linux-x86_64.egg/cudf/dataframe/dataframe.py in to_string(self, nrows, ncols)
405 if isinstance(self.index, cudf.dataframe.multiindex.MultiIndex) or\
406 isinstance(self.columns, cudf.dataframe.multiindex.MultiIndex):
--> 407 raise TypeError("You're trying to print a DataFrame that contains "
408 "a MultiIndex. Print this dataframe with "
409 ".to_pandas()")
TypeError: You're trying to print a DataFrame that contains a MultiIndex. Print this dataframe with .to_pandas() #Branch-0.6
len(df)
62914500
%time df1 = df.groupby(['cat1','cat','id','doc_length'], method='hash').count()
CPU times: user 584 ms, sys: 236 ms, total: 820 ms
Wall time: 857 ms
print(df1)
cat id doc_length cnt
cat1
0 0 1 1 59
0 0 2 4 56
0 0 3 2 58
0 0 4 1 59
0 0 5 10 50
0 0 6 13 47
0 0 7 10 50
0 0 8 10 50
0 0 9 10 50
0 0 10 10 50
[8769167 more rows] |
@MikeChenfu Can you try passing |
Thanks @kkraus14. The performance is great. It is about 1s. |
@MikeChenfu glad to hear the workaround worked well for you. I'm going to keep this open until we resolve the performance issues in using a MultiIndex. |
Added some print statements to the Cython and found the time is actually in the CPP libcudf call:
Given there's a major refactor of groupby coming down the pipes, it doesn't make sense to investigate the perf issue in that case until after the refactor has landed. |
This should have been autoresolved by #1702 |
It shows a different performance on the latest master
|
I test different dataset to show above problem. In [1]: from cudf import DataFrame
...: a = DataFrame({'b':[1,2,3,4,5,6,7,8] * 100000, 'd':range(800000),'c':range(800000),'cnt':range(800000)})
...: %time a1 = a.groupby(['b','d','c'],as_index=False).count()
...:
...: a2 = a1[['b','cnt']]
...: %time a3 = a2.groupby(['b'],as_index=False).count()
CPU times: user 476 ms, sys: 24 ms, total: 500 ms
Wall time: 525 ms
CPU times: user 1.1 s, sys: 788 ms, total: 1.89 s
Wall time: 1.89 s
In [2]: from cudf import DataFrame
...: a = DataFrame({'b':[1,2,3,4,5,6,7,8] * 1000000, 'd':range(8000000),'c':range(8000000),'cnt':range(8000000)})
...:
...: %time a1 = a.groupby(['b','d','c'],as_index=False).count()
...:
...: a2 = a1[['b','cnt']]
...: %time a3 = a2.groupby(['b'],as_index=False).count()
CPU times: user 76 ms, sys: 112 ms, total: 188 ms
Wall time: 230 ms
CPU times: user 48.5 s, sys: 35.8 s, total: 1min 24s
Wall time: 1min 24s |
@MikeChenfu We've identified a performance issue specifically with the |
In [3]: from cudf import DataFrame
...: a = DataFrame({'b':[1,2,3,4,5,6,7,8] * 1000000, 'd':range(8000000),'c':range(8000000),'cnt':range(8000000)})
...:
...: %time a1 = a.groupby(['b','d','c'],as_index=False).sum()
...:
...: a2 = a1[['b','cnt']]
...: %time a3 = a2.groupby(['b'],as_index=False).sum()
CPU times: user 176 ms, sys: 160 ms, total: 336 ms
Wall time: 398 ms
CPU times: user 1min 22s, sys: 1min, total: 2min 22s
Wall time: 2min 23s |
Thanks for the reproducer, I can confirm I can reproduce. |
After a fun game of This is the repro I've been using:
When timing with the Linux CC @harrism |
I think the cause is pretty clear... CUDA doesn't support atomicAdd natively on signed 64-bit ints, only unsigned. The groupby above resolves to building a hash table with key-type Before commit 542d960, there was this overload in cuDF, which just casts signed to unsigned to do the atomicAdd:
But now, we go through this path:
This ultimately calls an atomicCAS-based implementation, which may loop and result in a lot of divergence. It's no wonder the performance is so bad... What I don't understand is why 64-bit types are used at all here, when the Python code above sets all types to 'int32'... @kkraus14 any idea why that is? |
I have a tentative fix, which is to change this:
To this:
This change reduces the bottleneck kernel time from 17s to 5.4ms (!!!). However, I'm concerned about the safety of casting a signed to an unsigned value. |
Because |
Can count on a table with |
OK, here's a summary:
|
I feel it is safety for casting a signed to an unsigned value only for integer sumation as long as Two's complement is used for a signed number representation. |
@kovaltan is right, I had forgotten about the equivalence of signed and unsigned arithmetic with two's complement. I'm pretty sure that NVIDIA GPUs use two's complement. Otherwise compatibility of code with the CPU would be quite difficult. :) So this means we can fix the slow sum() aggregations for signed int64 too. |
Now, PR #1691 supports signed int64_t native |
@MikeChenfu this issue should now be resolved on https://github.com/rapidsai/cudf/tree/release-0.7.2 Closing the issue. Please reopen if you're still seeing the same performance issues. |
Thanks @jrhemstad . Is it available on 0.8? |
The fix was a hotfix to the 0.7 release. I believe it should already be available. |
I see the CHANGLOG.md , this PR may not be ready on 0.8. |
Once 0.7.2 is released (should be in the next 24 hours) and merged into master, we'll merge master into |
I found the groupby() is super slower on branch-0.7 than that on branch-0.6. The time output is attached.
Time outputs
The text was updated successfully, but these errors were encountered: