Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ORC write Map column #3821

Merged
merged 14 commits into from
Nov 2, 2021
Merged

Conversation

res-life
Copy link
Collaborator

This fixes #3784

Signed-off-by: Chong Gao res_life@163.com

Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Collaborator Author

build

@res-life res-life marked this pull request as draft October 14, 2021 10:38
@sameerz sameerz added the feature request New feature or request label Oct 14, 2021
@firestarman
Copy link
Collaborator

firestarman commented Oct 15, 2021

It looks good to me.
But quite strange that the writing map tests can pass locally but always fail in premerge builds.

FAILED ../../src/main/python/orc_write_test.py::test_write_save_table[hive-[Map(String(not_null),String), Map(Boolean(not_null),Boolean), Map(Byte(not_null),Byte), Map(Short(not_null),Short), Map(Integer(not_null),Integer), Map(Long(not_null),Long), Map(Float(not_null),Float), Map(Double(not_null),Double), Map(Timestamp(not_null),Timestamp), Map(Date(not_null),Date)]]
2021-10-14T09:44:47.4188925Z [2021-10-14T09:43:56.567Z] FAILED ../../src/main/python/orc_write_test.py::test_write_sql_save_table[native-TIMESTAMP_MILLIS-[Map(String(not_null),String), Map(Boolean(not_null),Boolean), Map(Byte(not_null),Byte), Map(Short(not_null),Short), Map(Integer(not_null),Integer), Map(Long(not_null),Long), Map(Float(not_null),Float), Map(Double(not_null),Double), Map(Timestamp(not_null),Timestamp), Map(Date(not_null),Date)]]
2021-10-14T09:44:47.4192147Z [2021-10-14T09:43:56.567Z] FAILED ../../src/main/python/orc_write_test.py::test_write_sql_save_table[hive-TIMESTAMP_MILLIS-[Map(String(not_null),String), Map(Boolean(not_null),Boolean), Map(Byte(not_null),Byte), Map(Short(not_null),Short), Map(Integer(not_null),Integer), Map(Long(not_null),Long), Map(Float(not_null),Float), Map(Double(not_null),Double), Map(Timestamp(not_null),Timestamp), Map(Date(not_null),Date)]]
2021-10-14T09:44:47.4195265Z [2021-10-14T09:43:56.568Z] FAILED ../../src/main/python/orc_write_test.py::test_write_save_table[native-[Map(String(not_null),String), Map(Boolean(not_null),Boolean), Map(Byte(not_null),Byte), Map(Short(not_null),Short), Map(Integer(not_null),Integer), Map(Long(not_null),Long), Map(Float(not_null),Float), Map(Double(not_null),Double), Map(Timestamp(not_null),Timestamp), Map(Date(not_null),Date)]]
2021-10-14T09:44:47.4198489Z [2021-10-14T09:43:56.568Z] FAILED ../../src/main/python/orc_write_test.py::test_write_sql_save_table[native-TIMESTAMP_MICROS-[Map(String(not_null),String), Map(Boolean(not_null),Boolean), Map(Byte(not_null),Byte), Map(Short(not_null),Short), Map(Integer(not_null),Integer), Map(Long(not_null),Long), Map(Float(not_null),Float), Map(Double(not_null),Double), Map(Timestamp(not_null),Timestamp), Map(Date(not_null),Date)]]
2021-10-14T09:44:47.4201767Z [2021-10-14T09:43:56.568Z] FAILED ../../src/main/python/orc_write_test.py::test_write_sql_save_table[hive-TIMESTAMP_MICROS-[Map(String(not_null),String), Map(Boolean(not_null),Boolean), Map(Byte(not_null),Byte), Map(Short(not_null),Short), Map(Integer(not_null),Integer), Map(Long(not_null),Long), Map(Float(not_null),Float), Map(Double(not_null),Double), Map(Timestamp(not_null),Timestamp), Map(Date(not_null),Date)]]
2021-10-14T09:44:47.4205239Z [2021-10-14T09:43:56.568Z] FAILED ../../src/main/python/orc_write_test.py::test_write_round_trip[native-[Map(String(not_null),String), Map(Boolean(not_null),Boolean), Map(Byte(not_null),Byte), Map(Short(not_null),Short), Map(Integer(not_null),Integer), Map(Long(not_null),Long), Map(Float(not_null),Float), Map(Double(not_null),Double), Map(Timestamp(not_null),Timestamp), Map(Date(not_null),Date)]]
2021-10-14T09:44:47.4208457Z [2021-10-14T09:43:56.568Z] FAILED ../../src/main/python/orc_write_test.py::test_write_round_trip[hive-[Map(String(not_null),String), Map(Boolean(not_null),Boolean), Map(Byte(not_null),Byte), Map(Short(not_null),Short), Map(Integer(not_null),Integer), Map(Long(not_null),Long), Map(Float(not_null),Float), Map(Double(not_null),Double), Map(Timestamp(not_null),Timestamp), Map(Date(not_null),Date)]]
 2021-10-14T09:44:47.4094676Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at py4j.GatewayConnection.run(GatewayConnection.java:238)�[0m
2021-10-14T09:44:47.4095839Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at java.lang.Thread.run(Thread.java:748)�[0m
2021-10-14T09:44:47.4097473Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   Caused by: java.io.IOException: Error reading file: file:/tmp/pyspark_tests/248646/ORC_DATA/GPU/part-00000-4b057420-2dda-4252-876e-cd1122379735-c000.snappy.orc�[0m
2021-10-14T09:44:47.4099559Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1329)�[0m
2021-10-14T09:44:47.4102304Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:77)�[0m
2021-10-14T09:44:47.4104997Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:93)�[0m
2021-10-14T09:44:47.4108031Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at org.apache.hadoop.hive.ql.io.orc.SparkOrcNewRecordReader.nextKeyValue(SparkOrcNewRecordReader.java:84)�[0m
2021-10-14T09:44:47.4111524Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)�[0m
2021-10-14T09:44:47.4113919Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)�[0m
2021-10-14T09:44:47.4115273Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)�[0m
2021-10-14T09:44:47.4117055Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)�[0m
2021-10-14T09:44:47.4118887Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)�[0m
2021-10-14T09:44:47.4120471Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)�[0m
2021-10-14T09:44:47.4122137Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)�[0m
2021-10-14T09:44:47.4123590Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)�[0m
2021-10-14T09:44:47.4125301Z [2021-10-14T09:43:56.566Z] �[1m�[31mE                   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)�[0m
2021-10-14T09:44:47.4127195Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)�[0m
2021-10-14T09:44:47.4128792Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)�[0m
2021-10-14T09:44:47.4130356Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)�[0m
2021-10-14T09:44:47.4132114Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.spark.scheduler.Task.run(Task.scala:127)�[0m
2021-10-14T09:44:47.4133636Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)�[0m
2021-10-14T09:44:47.4135151Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)�[0m
2021-10-14T09:44:47.4136642Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)�[0m
2021-10-14T09:44:47.4138576Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)�[0m
2021-10-14T09:44:47.4140675Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)�[0m
2021-10-14T09:44:47.4142135Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	... 1 more�[0m
2021-10-14T09:44:47.4144005Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   Caused by: java.io.EOFException: Read past end of bit field from bit reader current: 254 current bit index: 8 from byte rle literal used: 1/1 from compressed stream Stream for column 12 kind PRESENT position: 78 length: 78 range: 0 offset: 9325 limit: 9325 range 0 = 0 to 78 uncompressed: 75 to 75�[0m
2021-10-14T09:44:47.4146272Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.orc.impl.BitFieldReader.readByte(BitFieldReader.java:39)�[0m
2021-10-14T09:44:47.4148190Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.orc.impl.BitFieldReader.next(BitFieldReader.java:45)�[0m
2021-10-14T09:44:47.4150243Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:274)�[0m
2021-10-14T09:44:47.4152537Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.orc.impl.TreeReaderFactory$ShortTreeReader.nextVector(TreeReaderFactory.java:500)�[0m
2021-10-14T09:44:47.4154883Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.orc.impl.TreeReaderFactory$MapTreeReader.nextVector(TreeReaderFactory.java:2342)�[0m
2021-10-14T09:44:47.4157225Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2059)�[0m
2021-10-14T09:44:47.4159476Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1322)�[0m
2021-10-14T09:44:47.4160889Z [2021-10-14T09:43:56.567Z] �[1m�[31mE                   	... 22 more�[0m

Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

res-life commented Oct 15, 2021

The premerge build passed after I shrink the map(max_length=5), see my second commit, but I do not know why it passed.
I saw the following warnning in my local build log, may be this is the clue?

21/10/15 05:42:50 WARN TaskSetManager: 
Stage 135 contains a task of very large size (32945 KiB). 
The maximum recommended task size is 1000 KiB.

@revans2
Copy link
Collaborator

revans2 commented Oct 15, 2021

In python when it sends data to the tasks with parallelize it is sending it as a part of the serialized task itself. This is different from how scala does the same kind of thing. So it is warning you about sending 33 MiB of data. I think part of the issue is that you are writing them all out in a single invocation of the test. But a length of 5 feels fine to me.

Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

build

Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Collaborator Author

build

Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Collaborator Author

build

2 similar comments
@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

build

@pxLi
Copy link
Collaborator

pxLi commented Oct 19, 2021

build

@res-life please resolve the merge conflict before retrigger the CI, otherwise it won't work

Chong Gao added 2 commits October 19, 2021 16:16
Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Collaborator Author

build

Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Collaborator Author

build

@revans2
Copy link
Collaborator

revans2 commented Oct 19, 2021

Looks good.

Signed-off-by: Chong Gao <res_life@163.com>
revans2
revans2 previously approved these changes Oct 20, 2021
@res-life
Copy link
Collaborator Author

Test cases failed, blocked by Can not read the generated ORC file by Spark CPU

@res-life
Copy link
Collaborator Author

res-life commented Nov 2, 2021

build

@res-life
Copy link
Collaborator Author

res-life commented Nov 2, 2021

build

@res-life res-life marked this pull request as ready for review November 2, 2021 07:55
@res-life
Copy link
Collaborator Author

res-life commented Nov 2, 2021

@revans2 please review, no more changes.
CUDF fixed the blocking issue and the premerge passed: Can not read the generated ORC file by Spark CPU

@revans2 revans2 merged commit 27d0e89 into NVIDIA:branch-21.12 Nov 2, 2021
@res-life res-life deleted the orc-write-map branch April 16, 2022 00:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support ORC write Map column(single level)
5 participants