[BUG] The ORC output data of a query is not readable #1550

wjxiz1992 · 2021-01-19T16:06:10Z

Describe the bug
When reading(use some Dataframe APIs to operate on the ORC data) the orc ouput produced by the plugin, there's an error:

Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream Stream for column 5 kind LENGTH position: 23 length: 23 range: 0 offset: 2762004 limit: 2762004 range 0 = 0 to 23 uncompressed: 20 to 20
        at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61)
        at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
        at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryLengthStream(TreeReaderFactory.java:1778)
        at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory.java:1758)
        at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory.java:1500)
        at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.startStripe(TreeReaderFactory.java:2090)
        at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1105)
        at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1254)
        at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1289)
        at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:286)
        at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:669)
        at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:130)
        at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:216)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
        at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:491)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Steps/Code to reproduce bug
the output is produced by an LHA query, but I think it's safe to only point where it is on our egx machines: spark-egx-02:/home/allxu/q0_out_gpu.

$SPARK_HOME/bin/spark-shell

scala> val df = spark.read.orc("q0_out_gpu")
df: org.apache.spark.sql.DataFrame = [lx_id: bigint, lx_name: string ... 6 more fields]

scala> df.write.parquet("q0_convert_parquet")
21/01/20 00:03:45 ERROR Executor: Exception in task 6.0 in stage 0.0 (TID 6)
java.io.EOFException: Read past end of RLE integer from compressed stream Stream for column 5 kind LENGTH position: 23 length: 23 range: 0 offset: 2744101 limit: 2744101 range 0 = 0 to 23 uncompressed: 20 to 20
...
...

Expected behavior
No error should be seen.

Environment details (please complete the following information)

Environment location: Standalone

Additional context
It's a query from LHA, please reach me if I need to provide more information about it.

The text was updated successfully, but these errors were encountered:

jlowe · 2021-01-19T23:47:20Z

@wjxiz1992 is it possible to produce the same output but in a Parquet file? It would be good to know what data is supposed to be in the corrupted ORC file and see if we can reproduce the bad ORC file when loading the equivalent Parquet file then writing it to an ORC file just with libcudf.

wjxiz1992 · 2021-01-20T01:27:08Z

@wjxiz1992 is it possible to produce the same output but in a Parquet file? It would be good to know what data is supposed to be in the corrupted ORC file and see if we can reproduce the bad ORC file when loading the equivalent Parquet file then writing it to an ORC file just with libcudf.

Yes, I've put the parquet output (produced also by GPU) to spark-egx-02:/home/allxu/q0_out_gpu_parquet.
One more question about "just with libcudf": to me, libcudf is only the Java API for cuDF. So you mean I can create a java project and then use those APIs to do the job, so that it's "just with libcudf" ?

jlowe · 2021-01-20T16:41:38Z

By "just with libcudf" I meant isolating the issue by removing Spark from the equation. For example, just using the cudf APIs directly, e.g.: something like this from the Spark shell REPL:

val t = ai.rapids.cudf.Table.readParquet(new java.io.File("/tmp/data.parquet"))
t.writeORC(new java.io.File("/tmp/data.orc"))

and verify that the ORC file can be read by Spark CPU and looks correct relative to the Parquet file.

I've checked this, and the Parquet file does not replicate the corrupted ORC file either when writing with Spark GPU nor when using the cudf APIs directly. So either the corruption problem is sensitive to the ordering of the data (the ORC and Parquet files are ordered quite differently) or it's some other issue (e.g.: a race condition).

I noticed in the bad ORC file that one column in particular, a string column, is unreadable due to the corruption. The other columns are all readable by Spark CPU, however the data in another string column isn't completely correct. The first row has corrupted data relative to the Parquet file but many other rows are correct. So the nature of the corruption isn't completely isolated to just the one string column.

Does this issue happen every time when the query is run in this cluster?

wjxiz1992 · 2021-01-21T02:24:09Z

By "just with libcudf" I meant isolating the issue by removing Spark from the equation. For example, just using the cudf APIs directly, e.g.: something like this from the Spark shell REPL:
val t = ai.rapids.cudf.Table.readParquet(new java.io.File("/tmp/data.parquet"))
t.writeORC(new java.io.File("/tmp/data.orc"))
and verify that the ORC file can be read by Spark CPU and looks correct relative to the Parquet file.

I've checked this, and the Parquet file does not replicate the corrupted ORC file either when writing with Spark GPU nor when using the cudf APIs directly. So either the corruption problem is sensitive to the ordering of the data (the ORC and Parquet files are ordered quite differently) or it's some other issue (e.g.: a race condition).

I noticed in the bad ORC file that one column in particular, a string column, is unreadable due to the corruption. The other columns are all readable by Spark CPU, however the data in another string column isn't completely correct. The first row has corrupted data relative to the Parquet file but many other rows are correct. So the nature of the corruption isn't completely isolated to just the one string column.

Does this issue happen every time when the query is run in this cluster?

Thanks for the explanation!
Yes this issue happen every single time in both my local PC and NGC node. Both cases use standalone mode.
This query is run with a config of only 1 executor.

revans2 · 2021-02-12T20:13:53Z

cudf 0.18 has already shipped so we cannot fix this in the 0.4 release so I am moving this to 0.5 and filed #1722 to mitigate the issue in the 0.4 release.

wjxiz1992 · 2021-04-06T05:18:59Z

@revans2 This has been fixed after rapidsai/cudf#7565, I tested with latest 0.5 plugin jar and 0.19 cuDF jar.
Shall we update the config doc and turn on the orc write switch?

Re-enable the orc write since NVIDIA#1550 has been fixed Signed-off-by: Allen Xu <wjxiz1992@gmail.com>

Re-enable the orc write since #1550 has been fixed Signed-off-by: Allen Xu <wjxiz1992@gmail.com>

Re-enable the orc write since NVIDIA#1550 has been fixed Signed-off-by: Allen Xu <wjxiz1992@gmail.com>

wjxiz1992 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 19, 2021

wjxiz1992 changed the title ~~[BUG] The ORC output data of a query is not 100% readable~~ [BUG] The ORC output data of a query is not readable Jan 19, 2021

sameerz assigned hyperbolic2346 Jan 26, 2021

sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Jan 26, 2021

hyperbolic2346 mentioned this issue Feb 8, 2021

[BUG] orc writing can produce invalid orc file rapidsai/cudf#7346

Closed

revans2 mentioned this issue Feb 12, 2021

Disable ORC writing by default until bug can be fixed #1722

Closed

wjxiz1992 added a commit to wjxiz1992/spark-rapids that referenced this issue Apr 6, 2021

Enable orc write

711de12

Re-enable the orc write since NVIDIA#1550 has been fixed Signed-off-by: Allen Xu <wjxiz1992@gmail.com>

wjxiz1992 mentioned this issue Apr 6, 2021

Enable ORC write by default #2084

Merged

sameerz added this to the Mar 30 - Apr 9 milestone Apr 6, 2021

jlowe closed this as completed in #2084 Apr 7, 2021

jlowe pushed a commit that referenced this issue Apr 7, 2021

Enable orc write (#2084)

9eeecad

Re-enable the orc write since #1550 has been fixed Signed-off-by: Allen Xu <wjxiz1992@gmail.com>

nartal1 pushed a commit to nartal1/spark-rapids that referenced this issue Jun 9, 2021

Enable orc write (NVIDIA#2084)

91879b6

Re-enable the orc write since NVIDIA#1550 has been fixed Signed-off-by: Allen Xu <wjxiz1992@gmail.com>

nartal1 pushed a commit to nartal1/spark-rapids that referenced this issue Jun 9, 2021

Enable orc write (NVIDIA#2084)

9327c29

Re-enable the orc write since NVIDIA#1550 has been fixed Signed-off-by: Allen Xu <wjxiz1992@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] The ORC output data of a query is not readable #1550

[BUG] The ORC output data of a query is not readable #1550

wjxiz1992 commented Jan 19, 2021

jlowe commented Jan 19, 2021

wjxiz1992 commented Jan 20, 2021 •

edited

Loading

jlowe commented Jan 20, 2021

wjxiz1992 commented Jan 21, 2021

revans2 commented Feb 12, 2021

wjxiz1992 commented Apr 6, 2021

[BUG] The ORC output data of a query is not readable #1550

[BUG] The ORC output data of a query is not readable #1550

Comments

wjxiz1992 commented Jan 19, 2021

jlowe commented Jan 19, 2021

wjxiz1992 commented Jan 20, 2021 • edited Loading

jlowe commented Jan 20, 2021

wjxiz1992 commented Jan 21, 2021

revans2 commented Feb 12, 2021

wjxiz1992 commented Apr 6, 2021

wjxiz1992 commented Jan 20, 2021 •

edited

Loading