Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maven packages spark.jars.packages doesn't loaded into executers classpath #59

Open
BenMizrahiPlarium opened this issue Jan 5, 2021 · 6 comments

Comments

@BenMizrahiPlarium
Copy link

Hi

I'm having an issue while loading maven packages dependencies, while using SparkMagic and this helm chart for livy and spark in k8s.

In spark config I set the config:
spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1

the dependencies downloaded into /root/.ivy2/jars but dosn't included into spark classpath and when trying to execute action I'm getting the following error:

21/01/05 11:22:15 INFO DAGScheduler: ShuffleMapStage 1 (take at :30) failed in 0.197 s due to Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 17, 10.4.187.11, executor 2): java.lang.ClassNotFoundException: org.apache.spark.sql.kafka010.KafkaSourceRDDPartition

Do you have any suggestions ?

Thanks

@jahstreet
Copy link
Collaborator

Hi @BenMizrahiPlarium , one question to clarify your setup: do you configure spark.jars.packages in code through SparkConf ? Or how do you do that?

@BenMizrahiPlarium
Copy link
Author

Hi @jahstreet,

I did the setup using the config section in Livy create session request - in other use cases I think that the jar artifacts downloaded from maven is shared via HDFS and available in the driver and worker - in this usecase it’s only available in the driver maven local repository.

I see that the artifacts download- but doesn’t available in the executers classpath

If you have any idea - it will be very helpful:)

@brenoarosa
Copy link

brenoarosa commented Jan 18, 2021

Currently running in the same issue.
I'm trying the session parameters bellow (which works with spark2).
I can see in the driver log messages that the package is downloaded from maven but it's not passed to the executors.

session_params = {
        "kind": "pyspark",
        "driverCores": 2,
        "driverMemory": "12g",
        "executorCores": 7,
        "executorMemory": "42g",
        "numExecutors": 1,
        "conf": {
            "spark.jars.packages": "mysql:mysql-connector-java",                                                                                         
        },    
    }

The jar is also not uploaded to spark.kubernetes.file.upload.path (using a s3 bucket), but I don't know if this is the expected behavior.

Passing the http link to file in the jars parameter also don't work.

session_params = {
        "kind": "pyspark",
        "driverCores": 2,
        "driverMemory": "12g",
        "executorCores": 7,
        "executorMemory": "42g",
        "numExecutors": 1,
        "jars": ["https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.48/mysql-connector-java-5.1.48.jar"]
    }

@BenMizrahiPlarium
Copy link
Author

BenMizrahiPlarium commented Jan 18, 2021

I really think it’s because Spark has no shared file system between workers and driver.

As a workaround for now - I solve it using gsfuse by mounting google bucket in driver and workers and configure both maven local repository to point to this folder and add the folder to spark extra classpath.

so finally the bucket is mounted to /etc/google in both drivers and executers
maven jars points to /etc/google/jars and spark config includes:

spark.driver.extraClassPath=/etc/google/jars/*
spark.executer.extraClassPath=/etc/google/jars/*

and it works, but my problem with that is that it’s not temporary it’s persistent between all sessions- and it’s not isolated per user.

@brenoarosa
Copy link

brenoarosa commented Jan 18, 2021

I understood from the documentation that spark.kubernetes.file.upload.path should be used for sharing the files between driver and executors.
Indeed this works livy related jars.

I set mine spark.kubernetes.file.upload.path to s3a://somosdigital-datascience/spark3/
This is the livy logs when creating a new session:

21/01/18 20:26:20 INFO LineBufferedStream: 21/01/18 20:26:20 INFO KubernetesUtils: Uploading file: /opt/livy/rsc-jars/asm-5.0.4.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-6bb1a42f-082e-4621-8b6c-75e99ac92a76/asm-5.0.4.jar...
21/01/18 20:26:20 INFO LineBufferedStream: 21/01/18 20:26:20 INFO KubernetesUtils: Uploading file: /opt/livy/rsc-jars/livy-api-0.8.0-incubating-SNAPSHOT.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-b3a531b6-8c66-4e30-9b7a-30ffbf27dd8f/livy-api-0.8.0-incubating-SNAPSHOT.jar...
21/01/18 20:26:21 INFO LineBufferedStream: 21/01/18 20:26:21 INFO KubernetesUtils: Uploading file: /opt/livy/rsc-jars/livy-rsc-0.8.0-incubating-SNAPSHOT.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-738f5605-e3e0-499c-971a-7f8068be346d/livy-rsc-0.8.0-incubating-SNAPSHOT.jar...
21/01/18 20:26:21 INFO LineBufferedStream: 21/01/18 20:26:21 INFO KubernetesUtils: Uploading file: /opt/livy/rsc-jars/livy-thriftserver-session-0.8.0-incubating-SNAPSHOT.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-0f275439-d128-4388-b59e-e66da063deb7/livy-thriftserver-session-0.8.0-incubating-SNAPSHOT.jar...
21/01/18 20:26:22 INFO LineBufferedStream: 21/01/18 20:26:22 INFO KubernetesUtils: Uploading file: /opt/livy/rsc-jars/minlog-1.3.0.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-6d2c2679-8e89-4096-97d0-4dae83ada91c/minlog-1.3.0.jar...
21/01/18 20:26:22 INFO LineBufferedStream: 21/01/18 20:26:22 INFO KubernetesUtils: Uploading file: /opt/livy/rsc-jars/netty-all-4.1.47.Final.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-7ee64aca-6e94-43c3-8131-36d62bca562a/netty-all-4.1.47.Final.jar...
21/01/18 20:26:22 INFO LineBufferedStream: 21/01/18 20:26:22 INFO KubernetesUtils: Uploading file: /opt/livy/rsc-jars/objenesis-2.5.1.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-82bb5496-d3aa-47cd-8a4e-58178e3d8b36/objenesis-2.5.1.jar...
21/01/18 20:26:23 INFO LineBufferedStream: 21/01/18 20:26:23 INFO KubernetesUtils: Uploading file: /opt/livy/rsc-jars/reflectasm-1.11.3.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-4ff31636-583b-485e-b59d-a16006798b17/reflectasm-1.11.3.jar...
21/01/18 20:26:23 INFO LineBufferedStream: 21/01/18 20:26:23 INFO KubernetesUtils: Uploading file: /opt/livy/repl_2.12-jars/commons-codec-1.9.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-5f9c3fba-8357-4773-b8da-a350f9314f02/commons-codec-1.9.jar...
21/01/18 20:26:24 INFO LineBufferedStream: 21/01/18 20:26:24 INFO KubernetesUtils: Uploading file: /opt/livy/repl_2.12-jars/livy-core_2.12-0.8.0-incubating-SNAPSHOT.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-6dc3f8a0-147d-4f71-ac57-2cdbb1c76c00/livy-core_2.12-0.8.0-incubating-SNAPSHOT.jar...
21/01/18 20:26:24 INFO LineBufferedStream: 21/01/18 20:26:24 INFO KubernetesUtils: Uploading file: /opt/livy/repl_2.12-jars/livy-repl_2.12-0.8.0-incubating-SNAPSHOT.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-4ed754bc-55ed-41f5-84df-9a846589ba89/livy-repl_2.12-0.8.0-incubating-SNAPSHOT.jar... 

This is the relevant snippet from the driver configmap. I can also see from the driver and executors logs that both of them downloads the jars files:

spark.kubernetes.file.upload.path=s3a\://somosdigital-datascience/spark3/ 
spark.jars=s3a\://somosdigital-datascience/spark3//spark-upload-6bb1a42f-082e-4621-8b6c-75e99ac92a76/asm-5.0.4.jar,s3a\://somosdigital-datascience/spark3//spark-upload-b3a531b6-8c66-4e30-9b7a-30ffbf27dd8 
 f/livy-api-0.8.0-incubating-SNAPSHOT.jar,s3a\://somosdigital-datascience/spark3//spark-upload-738f5605-e3e0-499c-971a-7f8068be346d/livy-rsc-0.8.0-incubating-SNAPSHOT.jar,s3a\://somosdigital-datascience/s 
 park3//spark-upload-0f275439-d128-4388-b59e-e66da063deb7/livy-thriftserver-session-0.8.0-incubating-SNAPSHOT.jar,s3a\://somosdigital-datascience/spark3//spark-upload-6d2c2679-8e89-4096-97d0-4dae83ada91c/ 
 minlog-1.3.0.jar,s3a\://somosdigital-datascience/spark3//spark-upload-7ee64aca-6e94-43c3-8131-36d62bca562a/netty-all-4.1.47.Final.jar,s3a\://somosdigital-datascience/spark3//spark-upload-82bb5496-d3aa-47 
 cd-8a4e-58178e3d8b36/objenesis-2.5.1.jar,s3a\://somosdigital-datascience/spark3//spark-upload-4ff31636-583b-485e-b59d-a16006798b17/reflectasm-1.11.3.jar,s3a\://somosdigital-datascience/spark3//spark-uplo 
 ad-5f9c3fba-8357-4773-b8da-a350f9314f02/commons-codec-1.9.jar,s3a\://somosdigital-datascience/spark3//spark-upload-6dc3f8a0-147d-4f71-ac57-2cdbb1c76c00/livy-core_2.12-0.8.0-incubating-SNAPSHOT.jar,s3a\:/ 
 /somosdigital-datascience/spark3//spark-upload-4ed754bc-55ed-41f5-84df-9a846589ba89/livy-repl_2.12-0.8.0-incubating-SNAPSHOT.jar 

but the ones passed by the jars sessions parameters or config { spark.jars.packages } don't get uploaded.

@BenMizrahiPlarium
Copy link
Author

BenMizrahiPlarium commented Jan 18, 2021

In don’t think that spark.kubernetes.file.upload.path is related to external maven dependencies- it’s related to the spark application jar uploaded to S3.

Spark downloading maven dependencies into local maven repository and using it in the class path.

as far as I see, jars are loaded into the driver and every work done by the driver is fully supported, but when the actual task being execute in one of the executers- task failed due to ClassNotFound exception - it means that the jar isn’t available in the executer classpath

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants