[FEA] Support org.apache.spark.sql.catalyst.expressions.ReplicateRows #4104

viadea · 2021-11-15T16:56:22Z

Is your feature request related to a problem? Please describe.

This is a feature request to support org.apache.spark.sql.catalyst.expressions.ReplicateRows.
ReplicateRows is an internal function used by optimizer to rewrite EXCEPT ALL and INTERSECT ALL queries as per the Spark source code.

Below is a mini repro:

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

val data = Seq(
    Row(Row("Adam ","","Green"),"1","M",1000.1, "2019-01-01",List("Java","Scala")),
    Row(Row("Bob ","Middle","Green"),"2","M",2000.2, "2019-01-02",List("Java","Python")),
    Row(Row("Cathy ","","Green"),"3","F",3000.3, "2019-01-03",List())
)

val schema = (new StructType()
  .add("name",new StructType()
    .add("firstname",StringType)
    .add("middlename",StringType)
    .add("lastname",StringType)) 
  .add("id",StringType)
  .add("gender",StringType)
  .add("salary",DoubleType)
  .add("birthdayStr",StringType)
  .add("language",ArrayType(StringType))
             )

val df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)
df.withColumn("birthday", to_date(col("birthdayStr"))).write.format("parquet").mode("overwrite").save("/tmp/testparquet")
val df2 = spark.read.parquet("/tmp/testparquet")
df2.createOrReplaceTempView("df2")
df2.printSchema

val querytext="""SELECT gender from df2 except all (select gender from df2 where salary <> 10)"""
sql(querytext).explain

The Spark Driver log messages:

  !Exec <GenerateExec> cannot run on GPU because not all expressions can be replaced
    !NOT_FOUND <ReplicateRows> replicaterows(sum#99L, gender#76) cannot run on GPU because no GPU enabled version of expression class org.apache.spark.sql.catalyst.expressions.ReplicateRows could be found

The text was updated successfully, but these errors were encountered:

revans2 · 2021-11-16T21:06:38Z

The APIS already exist in CUDF TableView.repeat. The hardest part will be memory management and combining it with the existing code that is rather explode and pos_explode specific.

viadea added feature request New feature or request ? - Needs Triage Need team to review and classify labels Nov 15, 2021

Salonijain27 added P1 Nice to have for release and removed ? - Needs Triage Need team to review and classify labels Nov 16, 2021

nartal1 self-assigned this Nov 16, 2021

nartal1 mentioned this issue Dec 20, 2021

Add support for ReplicateRows #4388

Merged

nartal1 closed this as completed in #4388 Dec 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support org.apache.spark.sql.catalyst.expressions.ReplicateRows #4104

[FEA] Support org.apache.spark.sql.catalyst.expressions.ReplicateRows #4104

viadea commented Nov 15, 2021

revans2 commented Nov 16, 2021

[FEA] Support org.apache.spark.sql.catalyst.expressions.ReplicateRows #4104

[FEA] Support org.apache.spark.sql.catalyst.expressions.ReplicateRows #4104

Comments

viadea commented Nov 15, 2021

revans2 commented Nov 16, 2021