[Bug]: Spark2 runner fails to deserialize PipelineOptions due to NoSuchMethodError #23568

mosche · 2022-10-11T08:36:09Z

What happened?

Spark 2.4.8 uses a fairly old version of Jackson 2.6.7, while Beam is far ahead at 2.13.3.
When attempting to deserialize Pipeline options on a Spark worker, it will fail with a NoSuchMethodError it attempts to use a newer Jackson API that doesn't exist in the older Spark version.

Note: The Spark 2 runner is already deprecated. So this is likely a NO-FIX and mostly for reference.

java.lang.NoSuchMethodError: com.fasterxml.jackson.databind.type.TypeBindings.emptyBindings()Lcom/fasterxml/jackson/databind/type/TypeBindings;
	at org.apache.beam.sdk.options.PipelineOptionsFactory.createBeanProperty(PipelineOptionsFactory.java:1708)
	at org.apache.beam.sdk.options.PipelineOptionsFactory.computeDeserializerForMethod(PipelineOptionsFactory.java:1732)
	at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
	at org.apache.beam.sdk.options.PipelineOptionsFactory.getDeserializerForMethod(PipelineOptionsFactory.java:1782)
	at org.apache.beam.sdk.options.PipelineOptionsFactory.deserializeNode(PipelineOptionsFactory.java:1806)
	at org.apache.beam.sdk.options.ProxyInvocationHandler.getValueFromJson(ProxyInvocationHandler.java:584)
	at org.apache.beam.sdk.options.ProxyInvocationHandler.getValueFromJson(ProxyInvocationHandler.java:579)
	at org.apache.beam.sdk.options.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:219)
	at com.sun.proxy.$Proxy42.getOptionsId(Unknown Source)
	at org.apache.beam.sdk.options.ValueProvider$RuntimeValueProvider.setRuntimeOptions(ValueProvider.java:247)
	at org.apache.beam.sdk.options.ProxyInvocationHandler$Deserializer.deserialize(ProxyInvocationHandler.java:885)
	at org.apache.beam.sdk.options.ProxyInvocationHandler$Deserializer.deserialize(ProxyInvocationHandler.java:866)
	at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3736)
	at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2726)
	at org.apache.beam.runners.core.construction.SerializablePipelineOptions.deserializeFromJson(SerializablePipelineOptions.java:76)
	at org.apache.beam.runners.core.construction.SerializablePipelineOptions.readObject(SerializablePipelineOptions.java:61)

Issue Priority

Priority: 3

Issue Component

Component: runner-spark

The text was updated successfully, but these errors were encountered:

mosche · 2022-10-21T10:11:52Z

Unfortunately classpath issues are a common trouble both with Beam and Spark.
From Beam's perspective this is a NO FIX, downgrading Jackson to the old version used by Spark 2.4 is obviously not an option.

Spark offers a workaround to handle this using userClassPathFirst:

Property Name	Default	Meaning	Since Version
spark.driver.userClassPathFirst	false	(Experimental) Whether to give user-added jars precedence over Spark's own jars when loading classes in the driver. This feature can be used to mitigate conflicts between Spark's dependencies and user dependencies. It is currently an experimental feature. This is used in cluster mode only.	1.3.0
spark.executor.userClassPathFirst	false	(Experimental) Same functionality as spark.driver.userClassPathFirst, but applied to executor instances.	1.3.0

When enabling userClassPathFirst it's critical to remove some dependencies from the application uber jar. Specifically these are spark, hadoop, scala, slf4j, log4j dependencies. Otherwise related classes would be loaded by a separate classloader on the user classpath, making them incompatible with the ones loaded by Spark's system classpath loader (even if versions match exactly).

E.g., if using maven-shade-plugin, this can be done by adding the following configuration to exclude respective artifacts (by groupId).

<artifactSet>
  <excludes>
    <exclude>log4j</exclude>
    <exclude>org.slf4j</exclude>
    <exclude>io.dropwizard.metrics</exclude>
    <exclude>org.scala-lang</exclude>
    <exclude>org.scala-lang.modules</exclude>
    <exclude>org.apache.spark</exclude>
    <exclude>org.apache.hadoop</exclude>
  </excludes>
</artifactSet>

Additionally you might have to explicitly bump the version of some Jackson modules to match the version used by Beam.

<dependency>
  <groupId>com.fasterxml.jackson.module</groupId>
  <artifactId>jackson-module-scala_2.11</artifactId>
  <version>${jackson.version}</version>
  <scope>runtime</scope>
</dependency>
<dependency>
  <groupId>com.fasterxml.jackson.module</groupId>
  <artifactId>jackson-module-jaxb-annotations</artifactId>
  <version>${jackson.version}</version>
  <scope>runtime</scope>
</dependency>
<dependency>
  <groupId>com.fasterxml.jackson.module</groupId>
  <artifactId>jackson-module-paranamer</artifactId>
  <version>${jackson.version}</version>
  <scope>runtime</scope>
</dependency>

Step-by-step example

Generate the WordCount quickstart example
Modify the maven-shade-plugin configuration as mentioned above.
Add the Jackson dependencies mentioned above to the spark-runner profile.
Build the uber jar using the spark-runner profile
```
mvn package -Pspark-runner
```
Copy it to your Spark cluster (or shared storage)

Run spark-submit on the cluster

spark-submit --class org.apache.beam.examples.WordCount \
  --master spark://<SPARK_MASTER>:7077 \
  --deploy-mode cluster \
  --conf spark.driver.userClassPathFirst=true \
  --conf spark.executor.userClassPathFirst=true \
  <STORAGE_PATH>/word-count-beam-bundled-0.1.jar \
  --runner=SparkRunner \
  --output=<RESULT_PATH>/counts

Spark job-server / PortableRunner

If you depend on the Spark job-server to submit your jobs there's no obvious way to enable userClassPathFirst.
Though, this can still be done. You can set Spark properties via the _JAVA_OPTIONS environment variable.

_JAVA_OPTIONS="-Dspark.executor.userClassPathFirst=true"

Unfortunately this is not sufficient to successfully run the job-server for Spark2, likely due to an incompatible user classpath.

haoyuche · 2022-12-07T22:14:06Z

Actually in my case I changed fastxml libs both to runtime scope and it works without need to set the userClassPathFirst.

                <dependency>
                    <groupId>com.fasterxml.jackson.module</groupId>
                    <artifactId>jackson-module-scala_2.12</artifactId>
                    <version>${jackson.version}</version>
                    <exclusions>
                        <exclusion>
                            <groupId>org.scala-lang</groupId>
                            <artifactId>scala-library</artifactId>
                        </exclusion>
                    </exclusions>
                    <scope>runtime</scope>
                </dependency>
                <dependency>
                    <groupId>org.scala-lang</groupId>
                    <artifactId>scala-library</artifactId>
                    <version>2.12.10</version>
                    <scope>test</scope>
                </dependency>
                <dependency>
                    <groupId>com.fasterxml.jackson.core</groupId>
                    <artifactId>jackson-databind</artifactId>
                    <version>${jackson.version}</version>
                    <scope>runtime</scope>
                </dependency>

mosche · 2022-12-08T10:28:01Z

@haoyuche The problem here is that the Jackson version used by Spark 2 is too old for Beam.
What version of Spark & Beam are you using?

mosche added bug awaiting triage runners spark and removed awaiting triage labels Oct 11, 2022

github-actions bot added the P3 label Oct 11, 2022

mosche closed this as completed Oct 21, 2022

mosche closed this as not planned Won't fix, can't repro, duplicate, stale Oct 21, 2022

github-actions bot added this to the 2.44.0 Release milestone Oct 21, 2022

mosche removed this from the 2.44.0 Release milestone Oct 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Spark2 runner fails to deserialize PipelineOptions due to NoSuchMethodError #23568

[Bug]: Spark2 runner fails to deserialize PipelineOptions due to NoSuchMethodError #23568

mosche commented Oct 11, 2022

mosche commented Oct 21, 2022 •

edited

Loading

haoyuche commented Dec 7, 2022

mosche commented Dec 8, 2022

[Bug]: Spark2 runner fails to deserialize PipelineOptions due to NoSuchMethodError #23568

[Bug]: Spark2 runner fails to deserialize PipelineOptions due to NoSuchMethodError #23568

Comments

mosche commented Oct 11, 2022

What happened?

Issue Priority

Issue Component

mosche commented Oct 21, 2022 • edited Loading

Step-by-step example

Spark job-server / PortableRunner

haoyuche commented Dec 7, 2022

mosche commented Dec 8, 2022

mosche commented Oct 21, 2022 •

edited

Loading