Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force parallel world in Shim caller's classloader #3763

Conversation

gerashegalov
Copy link
Collaborator

@gerashegalov gerashegalov commented Oct 7, 2021

This PR codes up the approach that was tested on Databricks as an alternative while working on #3756

Fixes #3704

Signed-off-by: Gera Shegalov gera@apache.org

- uprev spark320 breaking DiskManager changes
- use the Serializer instance to find mutable classloader
- make the the update logic oblivious to the executor/driver side

Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
- making this option default because it's equivalent to the old flat jar
- making it optional because we still debug how we miss the non-default
  classloader in #NVIDIA#3704 and it's not the right behavior for addJar with
  userClassPathFirst. However, I think we shoudl generally stop
  documenting --jars as the plugin deploy option

Fixes NVIDIA#3704

Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
@gerashegalov gerashegalov added bug Something isn't working Spark 3.2+ labels Oct 7, 2021
@gerashegalov gerashegalov added this to the Oct 4 - Oct 15 milestone Oct 7, 2021
@gerashegalov gerashegalov self-assigned this Oct 7, 2021
@gerashegalov
Copy link
Collaborator Author

build

@tgravescs
Copy link
Collaborator

tgravescs commented Oct 7, 2021

generally stop documenting --jars as the plugin deploy option

Why would we do this? That is generally how you are supposed to distribute jars on spark, what is your alternative? this needs more explanation. for instance, on yarn you use --jars to get things into distributed cache to get sent to the nodes, without this you expect people to install it on every node?

@@ -135,7 +135,15 @@ object ShimLoader extends Logging {
// org/apache/spark/serializer/KryoSerializer.scala#L134

Option(SparkEnv.get)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, if changes are made to this file perhaps update function name to be more generic

@gerashegalov
Copy link
Collaborator Author

Why would we do this? That is generally how you are supposed to distribute jars on spark, what is your alternative? this needs more explanation. for instance, on yarn you use --jars to get things into distributed cache to get sent to the nodes, without this you expect people to install it on every node?

Good point. My main point that we should decide which of multiple ways to deploy plugin jars we recommend and support, and make sure that it works with all the features we provide such as the shuffle manager in all Spark deploy modes.

@abellina
Copy link
Collaborator

abellina commented Oct 7, 2021

Taking a look at this patch with UCX

@tgravescs tgravescs merged commit 16fc3aa into NVIDIA:branch-21.10 Oct 7, 2021
@gerashegalov
Copy link
Collaborator Author

While talking through the list of pros and cons with @tgravescs we concluded that my concern userClassPathFirst is moot with the following reasoning:
pre-21.10 the whole jar was visible to the ShimLoader's classloader, and we are not changing what classloader was chosen by Spark to load the shim. Thus by making the right sections of the jar accessible to the same class loader makes it only equivalent to the pre-21.10 setup.

So maybe having a reflection call into a non-public API is the only known downside at this moment.

@gerashegalov gerashegalov deleted the forceParallelWorldInCallerClassloader branch October 7, 2021 17:47
@abellina
Copy link
Collaborator

abellina commented Oct 7, 2021

Taking a look at this patch with UCX

It came up fine with and without earlyStart. I tried with the latest spark 3.2.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Spark 3.2+
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Executor-side ClassCastException when testing with Spark 3.2.1-SNAPSHOT in k8s environment
3 participants