[SPARK-49249][SPARK-49122] Makes `SparkSession.addArtifact` work with REPL #48120

xupefei · 2024-09-16T10:06:43Z

What changes were proposed in this pull request?

This PR makes the new SparkSession.addArtifact API (added in #47631) work with Spark REPL.

Why are the changes needed?

Because it didn't work before :)

Does this PR introduce any user-facing change?

Yes, the user can add a new artifact in the REPL and use it in the current REPL session.

How was this patch tested?

Added a new test.

Was this patch authored or co-authored using generative AI tooling?

No.

xupefei · 2024-09-16T10:08:03Z

sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala

-  private[sql] def registerJava(name: String, className: String, returnDataType: DataType): Unit = {
+  def registerJava(name: String, className: String, returnDataType: DataType): Unit = {


I have to make this method public so I can call it from REPL.

I am not against this. I am trying to understand the user facing consequences though. I'd probably prefer that we add support for Scala UDFs as well. That can be done in a follow-up though.

Can you file a follow-up?

hvanhovell · 2024-09-16T16:16:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala

-          body match {
-            case Left(e) =>
-              sc.listenerBus.post(startEvent)
+      JobArtifactSet.withActiveJobArtifactState(sparkSession.artifactManager.state) {


Can you check how this interacts with all the stuff we do in Connect to make this work? I feel that we are duplicating code now. cc @vicennial

An FYI to other reviewers: look at this file with hidden whitespace.

Hmm, with this in the execution code path, we may not need SessionHolder#withSession in a few places and can be cleaned up.

@vicennial Is there a end-to-end test for this? I did some modifications and want to know if it won't break anything.

@xupefei The ReplE2ESuite has some tests for the overall client->artifact->execution with artifact flow.
Python client package my have some E2E tests as well but I am not familiar of the current status.

sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala

hvanhovell · 2024-09-16T16:23:08Z

repl/src/test/scala/org/apache/spark/repl/ReplSuite.scala

@@ -396,4 +396,41 @@ class ReplSuite extends SparkFunSuite {
    Main.sparkContext.stop()
    System.clearProperty("spark.driver.port")
  }
+
+  test("register artifacts via SparkSession.addArtifact") {


Can you use a UDF defined in the REPL? If so how does this work with a JobArtifactSet? Do we layer the globally defined classpath over the session specific classpath? (I'd be nice to document this somewhere).

I added one more test, which defines a UDF that initialises an external class added as an artifact.

how does this work with a JobArtifactSet?

Can you elaborate? Afaik JobArtifactSet is not involved here since it's the artifact path that is applied when an active SparkSession is applied.

Classpath - It's the other way around: the session classpath is laid over the global one.

vicennial · 2024-09-17T07:17:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala

-          body match {
-            case Left(e) =>
-              sc.listenerBus.post(startEvent)
+      JobArtifactSet.withActiveJobArtifactState(sparkSession.artifactManager.state) {


Hmm, with this in the execution code path, we may not need SessionHolder#withSession in a few places and can be cleaned up.

vicennial

Oops, I meant to leave a comment (ignore the approval, I haven't gone through the whole PR)

hvanhovell · 2024-09-23T14:08:29Z

sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala

@@ -121,7 +120,7 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry)
   */
  private[sql] def registerJavaUDAF(name: String, className: String): Unit = {
    try {
-      val clazz = Utils.classForName[AnyRef](className)
+      val clazz = session.artifactManager.classloader.loadClass(className)


One follow-up here would be to cache the ArtifactManager classloader. I think we create that thing over and over.

Agree. We can invalidate the cache when a new JAR is added.

hvanhovell · 2024-09-23T15:16:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala

@@ -281,7 +281,10 @@ object SQLExecution extends Logging {
    val activeSession = sparkSession
    val sc = sparkSession.sparkContext
    val localProps = Utils.cloneProperties(sc.getLocalProperties)
-    val artifactState = JobArtifactSet.getCurrentJobArtifactState.orNull
+    // `getCurrentJobArtifactState` will return a stat only in Spark Connect mode. In non-Connect


I think it should be safe to use the SparkSession's jobArtifactState. They should be the same. cc @vicennial.

hvanhovell

LGTM - Pending @vicennial's sign-off.

xupefei added 3 commits September 12, 2024 11:46

make it work for sql

fe143d6

REPL

e2597c1

.

827e01e

github-actions bot added SQL SPARK SHELL labels Sep 16, 2024

xupefei commented Sep 16, 2024

View reviewed changes

hvanhovell reviewed Sep 16, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala Outdated Show resolved Hide resolved

hvanhovell reviewed Sep 16, 2024

View reviewed changes

vicennial approved these changes Sep 17, 2024

View reviewed changes

vicennial reviewed Sep 17, 2024

View reviewed changes

xupefei added 2 commits September 19, 2024 17:03

.

73d13f9

revert fmt change

caa4251

xupefei requested a review from hvanhovell September 23, 2024 13:47

hvanhovell reviewed Sep 23, 2024

View reviewed changes

hvanhovell approved these changes Sep 23, 2024

View reviewed changes

xupefei requested a review from vicennial September 23, 2024 15:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-49249][SPARK-49122] Makes `SparkSession.addArtifact` work with REPL #48120

[SPARK-49249][SPARK-49122] Makes `SparkSession.addArtifact` work with REPL #48120

xupefei commented Sep 16, 2024

xupefei Sep 16, 2024

hvanhovell Sep 16, 2024

hvanhovell Sep 23, 2024

xupefei Sep 23, 2024

hvanhovell Sep 16, 2024

hvanhovell Sep 16, 2024

vicennial Sep 17, 2024

xupefei Sep 23, 2024

vicennial Sep 23, 2024

hvanhovell Sep 16, 2024

xupefei Sep 19, 2024

vicennial Sep 17, 2024

vicennial left a comment

hvanhovell Sep 23, 2024

xupefei Sep 23, 2024 •

edited

Loading

hvanhovell Sep 23, 2024

hvanhovell left a comment

		private[sql] def registerJava(name: String, className: String, returnDataType: DataType): Unit = {
		def registerJava(name: String, className: String, returnDataType: DataType): Unit = {

[SPARK-49249][SPARK-49122] Makes SparkSession.addArtifact work with REPL #48120

Are you sure you want to change the base?

[SPARK-49249][SPARK-49122] Makes SparkSession.addArtifact work with REPL #48120

Conversation

xupefei commented Sep 16, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vicennial left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xupefei Sep 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hvanhovell left a comment

Choose a reason for hiding this comment

[SPARK-49249][SPARK-49122] Makes `SparkSession.addArtifact` work with REPL #48120

[SPARK-49249][SPARK-49122] Makes `SparkSession.addArtifact` work with REPL #48120

xupefei Sep 23, 2024 •

edited

Loading