Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add support for org.apache.spark.sql.execution.SampleExec #3419

Closed
viadea opened this issue Sep 9, 2021 · 4 comments · Fixed by #3789
Closed

[FEA] Add support for org.apache.spark.sql.execution.SampleExec #3419

viadea opened this issue Sep 9, 2021 · 4 comments · Fixed by #3789
Assignees
Labels
feature request New feature or request

Comments

@viadea
Copy link
Collaborator

viadea commented Sep 9, 2021

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I wish the RAPIDS Accelerator for Apache Spark would [...]

I wish the RAPIDS Accelerator for Apache Spark would support org.apache.spark.sql.execution.SampleExec.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Below api should work on GPU:

df1.sample(0.1).collect

Currently it falls back on CPU with below driver log:

!NOT_FOUND <SampleExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.SampleExec could be found

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.

@viadea viadea added feature request New feature or request ? - Needs Triage Need team to review and classify labels Sep 9, 2021
@revans2
Copy link
Collaborator

revans2 commented Sep 9, 2021

@viadea Do we need to match the Spark random number sampling 100%? CUDF has a sample API that supports sampling with and without replacement. But there are a number of issues with it that make it so it would be hard to use in this case and also make it not match Spark 100%.

If we need to be 100% accurate then the only way to do this is likely to use the existing Spark implementations of BernoulliCellSampler and PoissonSampler, but pass in a range from 0 to num_rows in the batch. The output would then be sent to the GPU as a gather map. This will likely not be that fast, as we have to generate the data in the same way that Spark would and send it to the GPU. But if we want compatibility this is about the only way to do it. This is also what we do for random number generation.

@Salonijain27 Salonijain27 removed the ? - Needs Triage Need team to review and classify label Sep 14, 2021
@res-life res-life self-assigned this Sep 27, 2021
@res-life
Copy link
Collaborator

res-life commented Sep 28, 2021

@viadea @revans2 The results varied each time on spark CPU env, the sample is really random, so I think it’s unnecessary to match 100%. What do you think?

The results are same if specify the seed parameter like this: df.sample(0.1, 1), the second parameter is seed.
And also if it's necessary to match 100%?

@revans2
Copy link
Collaborator

revans2 commented Sep 28, 2021

https://github.com/apache/spark/blob/8c51fd85d6e2e84795c03e9e3e1673f62243456a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2169-L2235

https://github.com/apache/spark/blob/8c51fd85d6e2e84795c03e9e3e1673f62243456a/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala#L308-L313

The sample API takes a few different parameters. A fraction, with replacement, and a seed. If you do not provide a seed, then a random seed is used so you end up with different results each time. This is always true for the TABLESAMPLE SQL function. There is no way to provide a key. If you provide a seed are the results different each time? From what I can see unless the shuffle ordering is different the result should be the same. Ideally we would want to match 100% if we can guarantee that the input also matches.

@viadea
Copy link
Collaborator Author

viadea commented Oct 29, 2021

Confirmed this using 20211029 snapshot 21.12 jars:

scala> df.sample(0.1).explain
== Physical Plan ==
GpuColumnarToRowTransition false
+- GpuSample 0.0, 0.1, false, -8147524907698676683
   +- GpuFileGpuScan parquet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants