Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Spark2 runner fails to deserialize PipelineOptions due to NoSuchMethodError #23568

Closed
mosche opened this issue Oct 11, 2022 · 3 comments
Closed

Comments

@mosche
Copy link
Member

mosche commented Oct 11, 2022

What happened?

Spark 2.4.8 uses a fairly old version of Jackson 2.6.7, while Beam is far ahead at 2.13.3.
When attempting to deserialize Pipeline options on a Spark worker, it will fail with a NoSuchMethodError it attempts to use a newer Jackson API that doesn't exist in the older Spark version.

Note: The Spark 2 runner is already deprecated. So this is likely a NO-FIX and mostly for reference.

java.lang.NoSuchMethodError: com.fasterxml.jackson.databind.type.TypeBindings.emptyBindings()Lcom/fasterxml/jackson/databind/type/TypeBindings;
	at org.apache.beam.sdk.options.PipelineOptionsFactory.createBeanProperty(PipelineOptionsFactory.java:1708)
	at org.apache.beam.sdk.options.PipelineOptionsFactory.computeDeserializerForMethod(PipelineOptionsFactory.java:1732)
	at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
	at org.apache.beam.sdk.options.PipelineOptionsFactory.getDeserializerForMethod(PipelineOptionsFactory.java:1782)
	at org.apache.beam.sdk.options.PipelineOptionsFactory.deserializeNode(PipelineOptionsFactory.java:1806)
	at org.apache.beam.sdk.options.ProxyInvocationHandler.getValueFromJson(ProxyInvocationHandler.java:584)
	at org.apache.beam.sdk.options.ProxyInvocationHandler.getValueFromJson(ProxyInvocationHandler.java:579)
	at org.apache.beam.sdk.options.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:219)
	at com.sun.proxy.$Proxy42.getOptionsId(Unknown Source)
	at org.apache.beam.sdk.options.ValueProvider$RuntimeValueProvider.setRuntimeOptions(ValueProvider.java:247)
	at org.apache.beam.sdk.options.ProxyInvocationHandler$Deserializer.deserialize(ProxyInvocationHandler.java:885)
	at org.apache.beam.sdk.options.ProxyInvocationHandler$Deserializer.deserialize(ProxyInvocationHandler.java:866)
	at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3736)
	at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2726)
	at org.apache.beam.runners.core.construction.SerializablePipelineOptions.deserializeFromJson(SerializablePipelineOptions.java:76)
	at org.apache.beam.runners.core.construction.SerializablePipelineOptions.readObject(SerializablePipelineOptions.java:61)

Issue Priority

Priority: 3

Issue Component

Component: runner-spark

@mosche
Copy link
Member Author

mosche commented Oct 21, 2022

Unfortunately classpath issues are a common trouble both with Beam and Spark.
From Beam's perspective this is a NO FIX, downgrading Jackson to the old version used by Spark 2.4 is obviously not an option.

Spark offers a workaround to handle this using userClassPathFirst:

Property Name Default Meaning Since Version
spark.driver.userClassPathFirst false (Experimental) Whether to give user-added jars precedence over Spark's own jars when loading classes in the driver. This feature can be used to mitigate conflicts between Spark's dependencies and user dependencies. It is currently an experimental feature. This is used in cluster mode only. 1.3.0
spark.executor.userClassPathFirst false (Experimental) Same functionality as spark.driver.userClassPathFirst, but applied to executor instances. 1.3.0

When enabling userClassPathFirst it's critical to remove some dependencies from the application uber jar. Specifically these are spark, hadoop, scala, slf4j, log4j dependencies. Otherwise related classes would be loaded by a separate classloader on the user classpath, making them incompatible with the ones loaded by Spark's system classpath loader (even if versions match exactly).

E.g., if using maven-shade-plugin, this can be done by adding the following configuration to exclude respective artifacts (by groupId).

<artifactSet>
  <excludes>
    <exclude>log4j</exclude>
    <exclude>org.slf4j</exclude>
    <exclude>io.dropwizard.metrics</exclude>
    <exclude>org.scala-lang</exclude>
    <exclude>org.scala-lang.modules</exclude>
    <exclude>org.apache.spark</exclude>
    <exclude>org.apache.hadoop</exclude>
  </excludes>
</artifactSet>

Additionally you might have to explicitly bump the version of some Jackson modules to match the version used by Beam.

<dependency>
  <groupId>com.fasterxml.jackson.module</groupId>
  <artifactId>jackson-module-scala_2.11</artifactId>
  <version>${jackson.version}</version>
  <scope>runtime</scope>
</dependency>
<dependency>
  <groupId>com.fasterxml.jackson.module</groupId>
  <artifactId>jackson-module-jaxb-annotations</artifactId>
  <version>${jackson.version}</version>
  <scope>runtime</scope>
</dependency>
<dependency>
  <groupId>com.fasterxml.jackson.module</groupId>
  <artifactId>jackson-module-paranamer</artifactId>
  <version>${jackson.version}</version>
  <scope>runtime</scope>
</dependency>

Step-by-step example

  1. Generate the WordCount quickstart example
  2. Modify the maven-shade-plugin configuration as mentioned above.
  3. Add the Jackson dependencies mentioned above to the spark-runner profile.
  4. Build the uber jar using the spark-runner profile
    mvn package -Pspark-runner
    
  5. Copy it to your Spark cluster (or shared storage)
  6. Run spark-submit on the cluster
    spark-submit --class org.apache.beam.examples.WordCount \
      --master spark://<SPARK_MASTER>:7077 \
      --deploy-mode cluster \
      --conf spark.driver.userClassPathFirst=true \
      --conf spark.executor.userClassPathFirst=true \
      <STORAGE_PATH>/word-count-beam-bundled-0.1.jar \
      --runner=SparkRunner \
      --output=<RESULT_PATH>/counts  
    

Spark job-server / PortableRunner

If you depend on the Spark job-server to submit your jobs there's no obvious way to enable userClassPathFirst.
Though, this can still be done. You can set Spark properties via the _JAVA_OPTIONS environment variable.

_JAVA_OPTIONS="-Dspark.executor.userClassPathFirst=true"

Unfortunately this is not sufficient to successfully run the job-server for Spark2, likely due to an incompatible user classpath.

@mosche mosche closed this as completed Oct 21, 2022
@mosche mosche closed this as not planned Won't fix, can't repro, duplicate, stale Oct 21, 2022
@github-actions github-actions bot added this to the 2.44.0 Release milestone Oct 21, 2022
@mosche mosche removed this from the 2.44.0 Release milestone Oct 21, 2022
@haoyuche
Copy link

haoyuche commented Dec 7, 2022

Actually in my case I changed fastxml libs both to runtime scope and it works without need to set the userClassPathFirst.

                <dependency>
                    <groupId>com.fasterxml.jackson.module</groupId>
                    <artifactId>jackson-module-scala_2.12</artifactId>
                    <version>${jackson.version}</version>
                    <exclusions>
                        <exclusion>
                            <groupId>org.scala-lang</groupId>
                            <artifactId>scala-library</artifactId>
                        </exclusion>
                    </exclusions>
                    <scope>runtime</scope>
                </dependency>
                <dependency>
                    <groupId>org.scala-lang</groupId>
                    <artifactId>scala-library</artifactId>
                    <version>2.12.10</version>
                    <scope>test</scope>
                </dependency>
                <dependency>
                    <groupId>com.fasterxml.jackson.core</groupId>
                    <artifactId>jackson-databind</artifactId>
                    <version>${jackson.version}</version>
                    <scope>runtime</scope>
                </dependency>

@mosche
Copy link
Member Author

mosche commented Dec 8, 2022

@haoyuche The problem here is that the Jackson version used by Spark 2 is too old for Beam.
What version of Spark & Beam are you using?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants