Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix regression from 21.12 where udfs defined in repl no longer worked #5030

Merged
merged 4 commits into from
Mar 25, 2022

Conversation

abellina
Copy link
Collaborator

Signed-off-by: Alessandro Bellina abellina@nvidia.com

Closes #5019.

Reverts a localized change to the udf-compiler introduced with this PR (#3726).

With ShimLoader.loadClass, we couldn't resolve a UDF defined in the repl, as documented here: #5019.

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
@abellina
Copy link
Collaborator Author

build

@wjxiz1992
Copy link
Collaborator

Should we add some tests for such case?

@abellina
Copy link
Collaborator Author

Should we add some tests for such case?

I am not sure this is worth it, and honestly I do not know how to replicate the repl in a test. I guess I could define a lambda in a different class loader, but then again I am not sure if it is wortwhile.

gerashegalov
gerashegalov previously approved these changes Mar 25, 2022
Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Would be great if we can also verify a non-REPL use of Scala UDF in a Spark Job manually. We don't have Scala integration tests.

…ion.scala

Co-authored-by: Gera Shegalov <gshegalov@nvidia.com>
@abellina
Copy link
Collaborator Author

abellina commented Mar 25, 2022

@gerashegalov

Would be great if we can also verify a non-REPL use of Scala UDF in a Spark Job manually. We don't have Scala integration tests.

Sorry it took me a little while. We do have the OpcodeSuite unit tests that do cover all the translations available (a spark session is created there). I also create a simple app, and submitted it using spark-submit against a standalone cluster:

package com.nvidia.spark.udf

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.udf

object SimpleApp {
  def main(args: Array[String]) {
    val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
    import spark.implicits._
    val dataset = List(true, false, true, false).toDS().repartition(1)
    val myudf: Boolean => Boolean = { x => !x }
    val u = udf(myudf)

    var result = dataset.withColumn("new", u('value))
    result.explain(true)
    result.show()

    spark.conf.set("spark.rapids.sql.udfCompiler.enabled", "true")
    result = dataset.withColumn("new", u('value))
    result.explain(true)

    result.show()
    spark.stop()
  }
}

With the plugin disabled:

== Physical Plan ==
*(1) Project [value#1, UDF(value#1) AS new#8]
+- Exchange RoundRobinPartitioning(1), REPARTITION_WITH_NUM, [id=#9]
   +- LocalTableScan [value#1]

With the plugin enabled:

== Physical Plan ==
*(1) Project [value#1, if (NOT value#1) true else false AS new#23]
+- Exchange RoundRobinPartitioning(1), REPARTITION_WITH_NUM, [id=#41]
   +- LocalTableScan [value#1]

@abellina
Copy link
Collaborator Author

build

@abellina abellina merged commit 35d777e into NVIDIA:branch-22.04 Mar 25, 2022
@abellina abellina deleted the udf/fix_repl_issue branch March 25, 2022 23:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] udf compiler failed to translate UDF in spark-shell
4 participants