[FEA] Add support for org.apache.spark.sql.execution.SampleExec #3419

viadea · 2021-09-09T03:18:34Z

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I wish the RAPIDS Accelerator for Apache Spark would [...]

I wish the RAPIDS Accelerator for Apache Spark would support org.apache.spark.sql.execution.SampleExec.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Below api should work on GPU:

df1.sample(0.1).collect

Currently it falls back on CPU with below driver log:

!NOT_FOUND <SampleExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.SampleExec could be found

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.

The text was updated successfully, but these errors were encountered:

revans2 · 2021-09-09T15:03:32Z

@viadea Do we need to match the Spark random number sampling 100%? CUDF has a sample API that supports sampling with and without replacement. But there are a number of issues with it that make it so it would be hard to use in this case and also make it not match Spark 100%.

If we need to be 100% accurate then the only way to do this is likely to use the existing Spark implementations of BernoulliCellSampler and PoissonSampler, but pass in a range from 0 to num_rows in the batch. The output would then be sent to the GPU as a gather map. This will likely not be that fast, as we have to generate the data in the same way that Spark would and send it to the GPU. But if we want compatibility this is about the only way to do it. This is also what we do for random number generation.

res-life · 2021-09-28T06:16:25Z

@viadea @revans2 The results varied each time on spark CPU env, the sample is really random, so I think it’s unnecessary to match 100%. What do you think?

The results are same if specify the seed parameter like this: df.sample(0.1, 1), the second parameter is seed.
And also if it's necessary to match 100%?

revans2 · 2021-09-28T13:18:56Z

https://github.com/apache/spark/blob/8c51fd85d6e2e84795c03e9e3e1673f62243456a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2169-L2235

https://github.com/apache/spark/blob/8c51fd85d6e2e84795c03e9e3e1673f62243456a/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala#L308-L313

The sample API takes a few different parameters. A fraction, with replacement, and a seed. If you do not provide a seed, then a random seed is used so you end up with different results each time. This is always true for the TABLESAMPLE SQL function. There is no way to provide a key. If you provide a seed are the results different each time? From what I can see unless the shuffle ordering is different the result should be the same. Ideally we would want to match 100% if we can guarantee that the input also matches.

viadea · 2021-10-29T18:37:52Z

Confirmed this using 20211029 snapshot 21.12 jars:

scala> df.sample(0.1).explain
== Physical Plan ==
GpuColumnarToRowTransition false
+- GpuSample 0.0, 0.1, false, -8147524907698676683
   +- GpuFileGpuScan parquet

viadea added feature request New feature or request ? - Needs Triage Need team to review and classify labels Sep 9, 2021

Salonijain27 removed the ? - Needs Triage Need team to review and classify label Sep 14, 2021

res-life self-assigned this Sep 27, 2021

res-life mentioned this issue Oct 11, 2021

GPU sample exec #3789

Merged

res-life closed this as completed in #3789 Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add support for org.apache.spark.sql.execution.SampleExec #3419

[FEA] Add support for org.apache.spark.sql.execution.SampleExec #3419

viadea commented Sep 9, 2021

revans2 commented Sep 9, 2021

res-life commented Sep 28, 2021 •

edited

Loading

revans2 commented Sep 28, 2021

viadea commented Oct 29, 2021

[FEA] Add support for org.apache.spark.sql.execution.SampleExec #3419

[FEA] Add support for org.apache.spark.sql.execution.SampleExec #3419

Comments

viadea commented Sep 9, 2021

revans2 commented Sep 9, 2021

res-life commented Sep 28, 2021 • edited Loading

revans2 commented Sep 28, 2021

viadea commented Oct 29, 2021

res-life commented Sep 28, 2021 •

edited

Loading