Support ORC write Map column #3821

res-life · 2021-10-14T07:56:54Z

This fixes #3784

Signed-off-by: Chong Gao res_life@163.com

Signed-off-by: Chong Gao <res_life@163.com>

res-life · 2021-10-14T07:57:22Z

build

firestarman · 2021-10-15T02:28:50Z

It looks good to me.
But quite strange that the writing map tests can pass locally but always fail in premerge builds.

FAILED ../../src/main/python/orc_write_test.py::test_write_save_table[hive-[Map(String(not_null),String), Map(Boolean(not_null),Boolean), Map(Byte(not_null),Byte), Map(Short(not_null),Short), Map(Integer(not_null),Integer), Map(Long(not_null),Long), Map(Float(not_null),Float), Map(Double(not_null),Double), Map(Timestamp(not_null),Timestamp), Map(Date(not_null),Date)]]
2021-10-14T09:44:47.4188925Z [2021-10-14T09:43:56.567Z] FAILED ../../src/main/python/orc_write_test.py::test_write_sql_save_table[native-TIMESTAMP_MILLIS-[Map(String(not_null),String), Map(Boolean(not_null),Boolean), Map(Byte(not_null),Byte), Map(Short(not_null),Short), Map(Integer(not_null),Integer), Map(Long(not_null),Long), Map(Float(not_null),Float), Map(Double(not_null),Double), Map(Timestamp(not_null),Timestamp), Map(Date(not_null),Date)]]
2021-10-14T09:44:47.4192147Z [2021-10-14T09:43:56.567Z] FAILED ../../src/main/python/orc_write_test.py::test_write_sql_save_table[hive-TIMESTAMP_MILLIS-[Map(String(not_null),String), Map(Boolean(not_null),Boolean), Map(Byte(not_null),Byte), Map(Short(not_null),Short), Map(Integer(not_null),Integer), Map(Long(not_null),Long), Map(Float(not_null),Float), Map(Double(not_null),Double), Map(Timestamp(not_null),Timestamp), Map(Date(not_null),Date)]]
2021-10-14T09:44:47.4195265Z [2021-10-14T09:43:56.568Z] FAILED ../../src/main/python/orc_write_test.py::test_write_save_table[native-[Map(String(not_null),String), Map(Boolean(not_null),Boolean), Map(Byte(not_null),Byte), Map(Short(not_null),Short), Map(Integer(not_null),Integer), Map(Long(not_null),Long), Map(Float(not_null),Float), Map(Double(not_null),Double), Map(Timestamp(not_null),Timestamp), Map(Date(not_null),Date)]]
2021-10-14T09:44:47.4198489Z [2021-10-14T09:43:56.568Z] FAILED ../../src/main/python/orc_write_test.py::test_write_sql_save_table[native-TIMESTAMP_MICROS-[Map(String(not_null),String), Map(Boolean(not_null),Boolean), Map(Byte(not_null),Byte), Map(Short(not_null),Short), Map(Integer(not_null),Integer), Map(Long(not_null),Long), Map(Float(not_null),Float), Map(Double(not_null),Double), Map(Timestamp(not_null),Timestamp), Map(Date(not_null),Date)]]
2021-10-14T09:44:47.4201767Z [2021-10-14T09:43:56.568Z] FAILED ../../src/main/python/orc_write_test.py::test_write_sql_save_table[hive-TIMESTAMP_MICROS-[Map(String(not_null),String), Map(Boolean(not_null),Boolean), Map(Byte(not_null),Byte), Map(Short(not_null),Short), Map(Integer(not_null),Integer), Map(Long(not_null),Long), Map(Float(not_null),Float), Map(Double(not_null),Double), Map(Timestamp(not_null),Timestamp), Map(Date(not_null),Date)]]
2021-10-14T09:44:47.4205239Z [2021-10-14T09:43:56.568Z] FAILED ../../src/main/python/orc_write_test.py::test_write_round_trip[native-[Map(String(not_null),String), Map(Boolean(not_null),Boolean), Map(Byte(not_null),Byte), Map(Short(not_null),Short), Map(Integer(not_null),Integer), Map(Long(not_null),Long), Map(Float(not_null),Float), Map(Double(not_null),Double), Map(Timestamp(not_null),Timestamp), Map(Date(not_null),Date)]]
2021-10-14T09:44:47.4208457Z [2021-10-14T09:43:56.568Z] FAILED ../../src/main/python/orc_write_test.py::test_write_round_trip[hive-[Map(String(not_null),String), Map(Boolean(not_null),Boolean), Map(Byte(not_null),Byte), Map(Short(not_null),Short), Map(Integer(not_null),Integer), Map(Long(not_null),Long), Map(Float(not_null),Float), Map(Double(not_null),Double), Map(Timestamp(not_null),Timestamp), Map(Date(not_null),Date)]]

 2021-10-14T09:44:47.4094676Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at py4j.GatewayConnection.run(GatewayConnection.java:238)�[0m
2021-10-14T09:44:47.4095839Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at java.lang.Thread.run(Thread.java:748)�[0m
2021-10-14T09:44:47.4097473Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   Caused by: java.io.IOException: Error reading file: file:/tmp/pyspark_tests/248646/ORC_DATA/GPU/part-00000-4b057420-2dda-4252-876e-cd1122379735-c000.snappy.orc�[0m
2021-10-14T09:44:47.4099559Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1329)�[0m
2021-10-14T09:44:47.4102304Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:77)�[0m
2021-10-14T09:44:47.4104997Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:93)�[0m
2021-10-14T09:44:47.4108031Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at org.apache.hadoop.hive.ql.io.orc.SparkOrcNewRecordReader.nextKeyValue(SparkOrcNewRecordReader.java:84)�[0m
2021-10-14T09:44:47.4111524Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)�[0m
2021-10-14T09:44:47.4113919Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)�[0m
2021-10-14T09:44:47.4115273Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)�[0m
2021-10-14T09:44:47.4117055Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)�[0m
2021-10-14T09:44:47.4118887Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)�[0m
2021-10-14T09:44:47.4120471Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)�[0m
2021-10-14T09:44:47.4122137Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)�[0m
2021-10-14T09:44:47.4123590Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)�[0m
2021-10-14T09:44:47.4125301Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)�[0m
2021-10-14T09:44:47.4127195Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)�[0m
2021-10-14T09:44:47.4128792Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)�[0m
2021-10-14T09:44:47.4130356Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)�[0m
2021-10-14T09:44:47.4132114Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.spark.scheduler.Task.run(Task.scala:127)�[0m
2021-10-14T09:44:47.4133636Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)�[0m
2021-10-14T09:44:47.4135151Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)�[0m
2021-10-14T09:44:47.4136642Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)�[0m
2021-10-14T09:44:47.4138576Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)�[0m
2021-10-14T09:44:47.4140675Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)�[0m
2021-10-14T09:44:47.4142135Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	... 1 more�[0m
2021-10-14T09:44:47.4144005Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   Caused by: java.io.EOFException: Read past end of bit field from bit reader current: 254 current bit index: 8 from byte rle literal used: 1/1 from compressed stream Stream for column 12 kind PRESENT position: 78 length: 78 range: 0 offset: 9325 limit: 9325 range 0 = 0 to 78 uncompressed: 75 to 75�[0m
2021-10-14T09:44:47.4146272Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.orc.impl.BitFieldReader.readByte(BitFieldReader.java:39)�[0m
2021-10-14T09:44:47.4148190Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.orc.impl.BitFieldReader.next(BitFieldReader.java:45)�[0m
2021-10-14T09:44:47.4150243Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:274)�[0m
2021-10-14T09:44:47.4152537Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.orc.impl.TreeReaderFactory$ShortTreeReader.nextVector(TreeReaderFactory.java:500)�[0m
2021-10-14T09:44:47.4154883Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.orc.impl.TreeReaderFactory$MapTreeReader.nextVector(TreeReaderFactory.java:2342)�[0m
2021-10-14T09:44:47.4157225Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2059)�[0m
2021-10-14T09:44:47.4159476Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1322)�[0m
2021-10-14T09:44:47.4160889Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	... 22 more�[0m

Signed-off-by: Chong Gao <res_life@163.com>

res-life · 2021-10-15T03:27:14Z

build

res-life · 2021-10-15T05:55:00Z

The premerge build passed after I shrink the map(max_length=5), see my second commit, but I do not know why it passed.
I saw the following warnning in my local build log, may be this is the clue?

21/10/15 05:42:50 WARN TaskSetManager: 
Stage 135 contains a task of very large size (32945 KiB). 
The maximum recommended task size is 1000 KiB.

revans2 · 2021-10-15T12:33:04Z

In python when it sends data to the tasks with parallelize it is sending it as a part of the serialized task itself. This is different from how scala does the same kind of thing. So it is warning you about sending 33 MiB of data. I think part of the issue is that you are writing them all out in a single invocation of the test. But a length of 5 feels fine to me.

Signed-off-by: Chong Gao <res_life@163.com>

res-life · 2021-10-18T03:46:08Z

build

This reverts commit d1f7da8.

res-life · 2021-10-18T09:08:27Z

build

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

integration_tests/src/main/python/orc_write_test.py

Signed-off-by: Chong Gao <res_life@163.com>

res-life · 2021-10-18T14:14:41Z

build

Signed-off-by: Chong Gao <res_life@163.com>

res-life · 2021-10-19T02:43:05Z

build

res-life · 2021-10-19T03:07:12Z

build

res-life · 2021-10-19T05:34:48Z

build

pxLi · 2021-10-19T05:36:27Z

build

@res-life please resolve the merge conflict before retrigger the CI, otherwise it won't work

Signed-off-by: Chong Gao <res_life@163.com>

res-life · 2021-10-19T08:19:21Z

build

Signed-off-by: Chong Gao <res_life@163.com>

res-life · 2021-10-19T11:22:39Z

build

revans2 · 2021-10-19T11:29:12Z

Looks good.

Signed-off-by: Chong Gao <res_life@163.com>

res-life · 2021-10-22T16:12:15Z

Test cases failed, blocked by Can not read the generated ORC file by Spark CPU

This reverts commit 2d838ff.

res-life · 2021-11-02T03:15:54Z

build

res-life · 2021-11-02T05:36:14Z

build

res-life · 2021-11-02T08:49:38Z

@revans2 please review, no more changes.
CUDF fixed the blocking issue and the premerge passed: Can not read the generated ORC file by Spark CPU

Support ORC write Map column

3cdd87e

Signed-off-by: Chong Gao <res_life@163.com>

res-life marked this pull request as draft October 14, 2021 10:38

sameerz assigned res-life Oct 14, 2021

sameerz added the feature request New feature or request label Oct 14, 2021

ci failed, try to test

c35a997

Signed-off-by: Chong Gao <res_life@163.com>

CI failed, try to test

d1f7da8

Signed-off-by: Chong Gao <res_life@163.com>

Chong Gao added 2 commits October 18, 2021 17:03

Revert "CI failed, try to test"

bf87567

This reverts commit d1f7da8.

CI failed, try to test

ce8a7d6

revans2 reviewed Oct 18, 2021

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Outdated Show resolved Hide resolved

integration_tests/src/main/python/orc_write_test.py Show resolved Hide resolved

Test

ced52d9

Signed-off-by: Chong Gao <res_life@163.com>

Only for test

3c9a602

Signed-off-by: Chong Gao <res_life@163.com>

Chong Gao added 2 commits October 19, 2021 16:16

Merge branch-21.12

578e95a

Signed-off-by: Chong Gao <res_life@163.com>

Update doc

4fc6ff9

Test

2d838ff

Signed-off-by: Chong Gao <res_life@163.com>

Add comment

05112bd

Signed-off-by: Chong Gao <res_life@163.com>

revans2 previously approved these changes Oct 20, 2021

View reviewed changes

Chong Gao added 2 commits November 2, 2021 10:59

Merge branch 'branch-21.12' into orc-write-map

6c034db

Revert "Test"

ddc3129

This reverts commit 2d838ff.

res-life dismissed revans2’s stale review via ddc3129 November 2, 2021 03:01

Merge branch 'branch-21.12' into orc-write-map

83cf14f

res-life marked this pull request as ready for review November 2, 2021 07:55

revans2 approved these changes Nov 2, 2021

View reviewed changes

revans2 merged commit 27d0e89 into NVIDIA:branch-21.12 Nov 2, 2021

res-life mentioned this pull request Nov 10, 2021

[test only, do not review] Orc write map #3874

Closed

res-life deleted the orc-write-map branch April 16, 2022 00:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support ORC write Map column #3821

Support ORC write Map column #3821

res-life commented Oct 14, 2021

res-life commented Oct 14, 2021

firestarman commented Oct 15, 2021 •

edited

Loading

res-life commented Oct 15, 2021

res-life commented Oct 15, 2021 •

edited

Loading

revans2 commented Oct 15, 2021

res-life commented Oct 18, 2021

res-life commented Oct 18, 2021

res-life commented Oct 18, 2021

res-life commented Oct 19, 2021

res-life commented Oct 19, 2021

res-life commented Oct 19, 2021

pxLi commented Oct 19, 2021

res-life commented Oct 19, 2021

res-life commented Oct 19, 2021

revans2 commented Oct 19, 2021

res-life commented Oct 22, 2021

res-life commented Nov 2, 2021

res-life commented Nov 2, 2021

res-life commented Nov 2, 2021 •

edited

Loading

Support ORC write Map column #3821

Support ORC write Map column #3821

Conversation

res-life commented Oct 14, 2021

res-life commented Oct 14, 2021

firestarman commented Oct 15, 2021 • edited Loading

res-life commented Oct 15, 2021

res-life commented Oct 15, 2021 • edited Loading

revans2 commented Oct 15, 2021

res-life commented Oct 18, 2021

res-life commented Oct 18, 2021

res-life commented Oct 18, 2021

res-life commented Oct 19, 2021

res-life commented Oct 19, 2021

res-life commented Oct 19, 2021

pxLi commented Oct 19, 2021

res-life commented Oct 19, 2021

res-life commented Oct 19, 2021

revans2 commented Oct 19, 2021

res-life commented Oct 22, 2021

res-life commented Nov 2, 2021

res-life commented Nov 2, 2021

res-life commented Nov 2, 2021 • edited Loading

firestarman commented Oct 15, 2021 •

edited

Loading

res-life commented Oct 15, 2021 •

edited

Loading

res-life commented Nov 2, 2021 •

edited

Loading