Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ZSTD version mismatch in integration tests #10589

Closed
parthosa opened this issue Mar 13, 2024 · 5 comments
Closed

[BUG] ZSTD version mismatch in integration tests #10589

parthosa opened this issue Mar 13, 2024 · 5 comments
Assignees
Labels
bug Something isn't working test Only impacts tests

Comments

@parthosa
Copy link
Collaborator

parthosa commented Mar 13, 2024

Multiple integration tests test_parquet_append_with_downcast, test_parquet_write_column_name_with_dots etc failed with the following error:

[2024-03-13T19:07:42.384Z] E  : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2428.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2428.0 (TID 2996) (rapids-it-dataproc-20-ubuntu18-430-<url> executor 1): java.io.IOException: Decompression error: Version not supported
[2024-03-13T19:07:42.384Z] E  at com.github.luben.zstd.ZstdInputStream.readInternal(ZstdInputStream.java:185)
[2024-03-13T19:07:42.384Z] E  at com.github.luben.zstd.ZstdInputStream.read(ZstdInputStream.java:137)
@parthosa parthosa added bug Something isn't working ? - Needs Triage Need team to review and classify build Related to CI / CD or cleanly building test Only impacts tests and removed build Related to CI / CD or cleanly building labels Mar 13, 2024
@jlowe
Copy link
Member

jlowe commented Mar 18, 2024

I dug into this a bit, and unexpectedly found that the RAPIDS Accelerator is not using ZSTD during these tests. Dataproc 2.0 is running Spark 3.1.x, so the tests avoid trying to use the ZSTD codec in that case. However Spark itself is trying to use ZSTD for the map statistics during shuffle, and that's what's failing during decode,. The RAPIDS Accelerator shouldn't be involved in that code path at all, especially since the RAPIDS shuffle is not configured for these tests.

I tried rolling back to a couple of plugin snapshot versions that were known to pass (one each from 3/10 and 2/28) and they both fail in the same way. I ssh'd to the worker nodes to manually verify the classpath was using the intended jar version and not the new one.

@jayadeep-jayaraman
Copy link

Looks like this is related to SPARK-35199. The workaround provided is to set spark.shuffle.mapStatus.compression.codec to lz4. I believe these issues are fixed starting Spark 3.2 and therefore, could you please confirm if you are observing this issue in Dataproc 2.1 version

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Mar 26, 2024
@NvTimLiu
Copy link
Collaborator

NvTimLiu commented Mar 27, 2024

Looks like this is related to SPARK-35199. The workaround provided is to set spark.shuffle.mapStatus.compression.codec to lz4. I believe these issues are fixed starting Spark 3.2 and therefore, could you please confirm if you are observing this issue in Dataproc 2.1 version

Tried on 2.0 version, --conf spark.shuffle.mapStatus.compression.codec=lz4 can also PASS all tests cases

[2024-03-27T07:27:28.304Z] ++ SPARK_SUBMIT_FLAGS='--master yarn     --num-executors 1   
 --executor-memory 10G     --conf spark.yarn.tags=jenkins-tim-rapids-it-dataproc-2.0-ubuntu18-5    
 --conf spark.yarn.maxAppAttempts=1     --conf spark.yarn.appMasterEnv.PYSP_TEST_spark_eventLog_enabled=true  
 --conf spark.sql.adaptive.enabled=true  --conf spark.task.cpus=1     --conf spark.task.resource.gpu.amount=0.25     
--conf spark.executor.cores=4  --conf spark.locality.wait=0   --conf spark.shuffle.mapStatus.compression.codec=lz4      '

[2024-03-27T07:27:28.304Z] ++ cd integration_tests


[2024-03-27T07:40:44.009Z] -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
[2024-03-27T07:40:44.009Z] - generated xml file: integration_tests/target/run_dir-20240327072734-OhFS/TEST-pytest-1711524454084534245.xml -
[2024-03-27T07:40:44.009Z]  428 passed, 18 skipped, 26631 deselected, 22 xfailed, 2 xpassed, 400 warnings in 770.38s (0:12:50)

@sameerz
Copy link
Collaborator

sameerz commented Apr 1, 2024

@NvTimLiu as discussed let's update the Dataproc 2.0 integration tests only to use --conf spark.shuffle.mapStatus.compression.codec=lz4 . Once that is done let's close this issue.

@NvTimLiu
Copy link
Collaborator

NvTimLiu commented Apr 7, 2024

Our fix of Dataproc 2.0 integration tests only adding --conf spark.shuffle.mapStatus.compression.codec=lz4 had been applied in our Jenkins CI script, and the dataproc-2.0-ubuntu18 CI jobs also has been PASS for days

image

Let's close the issue, thanks

@NvTimLiu NvTimLiu closed this as completed Apr 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working test Only impacts tests
Projects
None yet
Development

No branches or pull requests

6 participants