Support orc coalescing reading #2909

wbo4958 · 2021-07-12T10:44:04Z

Signed-off-by: Bobby Wang wbo4958@gmail.com

This PR is to support ORC coalescing reading.

The ORC coalescing reading is pretty the same with Parquet.

read the stripes from small files
coalesce stripes into a bigger HostMemoryBuffer with the orc file format without any statistics
feed the HostMemoryBuffer to GPU to decode
get a bigger ColumnarBatch

Below is the performance test result.

Performance on Standalone

1 CPU 12 cores, and 1 TITAN V (12G memory)

Non-partitioned 5000 orc files, total 1.3G in LOCAL storage

PERFILE COALESCING MULTITHREADED CPU

time(s) 24.6s 10.4s 24.8s 20.0s
Partitioned 4092 orc files, total 1.3G in LOCAL storage

PERFILE COALESCING MULTITHREADED CPU

time(s) 23.0s 10.9s 23.2s 19.8s

Performance on Non-partitioned ORC files in Databricks

Non-partitioned 5000 orc files, total 1.3G in LOCAL storage

PERFILE COALESCING MULTITHREADED

time(s) 21s 9.6s 21.2s
Non-partitioned 5000 orc files, total 1.3G in DBFS storage

PERFILE COALESCING MULTITHREADED

time(s) 77.3s 57.3s 24.9s
Non-partitioned 2659 orc files, total 6.9M, basically 1 row for 1 orc file. in LOCAL storage

PERFILE COALESCING MULTITHREADED

time(s) 3.46s 0.45s 3.37s
Non-partitioned 2659 orc files, total 6.9M, basically 1 row for 1 orc file. in DBFS storage

PERFILE COALESCING MULTITHREADED

time(s) 22.87s 17.71s 8.75s

Performance on Partitioned ORC files in Databricks

Partitioned 5797 orc files, total 1.3G in DBFS storage

PERFILE COALESCING MULTITHREADED

time(s) 99.3s 67.1s 26.4s
Partitioned 5797 orc files, total 1.3G in LOCAL storage

PERFILE COALESCING MULTITHREADED

time(s) 25.8s 9.8s 24.3s

result comparison for CPU and COALESCING

The results collected and sorted locally on the driver side are the same for CPU and COALESCING reading for both (Partitioned 561 orc files, total 375M) and (Non-partitioned 300 ORC files, total 354M)

Re-design the OrcPartitionReaderContext to fix orc reader leak issue

The previous OrcPartitionReaderContext holds the Orc readers during the whole Spark task, which causes some side-effects, like keep occupying the S3 connection pool which results in others trying to acquire the connection timeout. See the issue #2850

I previously had a PR to fix this issue for PERFILE by closing orc readers after an ORC file has finished reading, See the merged PR #2881.

But the #2881 PR is not available for COALESCING anymore, COALESCING needs to filter out all the stripes from all orc files beforehand, which means it will create all OrcPartitionReaderContext for all orc files, and we can't close all OrcPartitionReaderContext in time, because we will coalesce them later. So when the number of created OrcPartitionReaderContext has exceeded the number of the max S3 pool, issue #2850 happens again.

The fix is to remove the orc readers from OrcPartitionReaderContext which only keeps some necessary information.
when reading from ORC is necessary, we can create the orc reader and read and destroy. And at the same time, we have a threadpool to control how many tasks are processing simultaneously, So, as long as the number of concurrent orc readings is less than the S3 pool size. it will work.

wbo4958 · 2021-07-12T11:28:45Z

build

Signed-off-by: Bobby Wang <wbo4958@gmail.com>

tgravescs · 2021-07-14T13:47:38Z

it would be good to update description saying fixed the leak and how you did it as well

tgravescs · 2021-07-14T14:29:21Z

Partitioned 561 orc files, total 375M

Did you run any tests on partitioned files? Also did we compare to CPU?
What is still in progress here?

docs/configs.md

tgravescs

made one pass through.

tgravescs · 2021-07-14T14:33:30Z

integration_tests/src/main/python/orc_test.py

@@ -30,7 +30,9 @@ def read_orc_sql(data_path):
 # test with original orc file reader, the multi-file parallel reader for cloud
 original_orc_file_reader_conf = {'spark.rapids.sql.format.orc.reader.type': 'PERFILE'}
 multithreaded_orc_file_reader_conf = {'spark.rapids.sql.format.orc.reader.type': 'MULTITHREADED'}
-reader_opt_confs = [original_orc_file_reader_conf, multithreaded_orc_file_reader_conf]
+coalescing_orc_file_reader_conf = {'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}
+# reader_opt_confs = [original_orc_file_reader_conf, multithreaded_orc_file_reader_conf, coalescing_orc_file_reader_conf]


need to add more test it looks like or were some failing for some reason?

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

tgravescs · 2021-07-14T20:14:25Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala

+   * @return Long, the estimated output size
+   */
+  override def calculateEstimatedBlocksOutputSize(
+      filesAndBlocks: LinkedHashMap[Path, ArrayBuffer[DataBlockBase]],


maybe I'm missing it did we really need to change this interface to take the LinkedHashMap? It looks like you ignore the Path part of filesAndBlocks below in the foreach?

Better to change to the LinkedHashMap.

When we estimate the initial file footer size, we can use the original file's footer size as the worst case for all stripes in that file. or else, we need to accumulate every footer size for every stripe, which is a bit over-estimating.

Yes, I just ignore the Path, and I use the stripes(0).ctx to estimate the footer size for the stripes in that Path

I don't really understand this response, why is it better to changed to LinkedHashMap and pass parameters you aren't using? If the answer is because the filesAndBlock is already a LinkedHashMap so we don't have to do any transformations, I'm fine with that but you need to say that and I would like to see it documented, otherwise someone is going to come along later and wonder about it

Done, added some comments.

wbo4958 · 2021-07-14T22:54:44Z

it would be good to update description saying fixed the leak and how you did it as well

Done

wbo4958 · 2021-07-14T22:57:04Z

Partitioned 561 orc files, total 375M

Did you run any tests on partitioned files? Also did we compare to CPU?
What is still in progress here?

Yes, I did two kind of tests, the first one is the performance test for non-partitioned files. The second one the the comparison test for CPU and GPU for Partitioned files. You can see them in the description.

Is there any necessary to run the performance test on partitioned files?

tgravescs · 2021-07-14T23:12:17Z

Yes we should run something on partitioned files to make sure they are handled properly

wbo4958 · 2021-07-15T09:14:39Z

build

wbo4958 · 2021-07-15T11:36:09Z

@tgravescs , I added the Performance tests for partitioned and non-partitioned files and did the comparison tests for CPU/COALESCING for both partitioned and non-partitioned files. Pls refer to the descriptionon

tgravescs · 2021-07-15T13:42:29Z

perhaps I'm missing it, do you have the perf results just running on CPU of reading some of these?

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala

tgravescs · 2021-07-15T14:04:15Z

overall very close a few nits and questions.

wbo4958 · 2021-07-16T00:53:13Z

build

wbo4958 · 2021-07-16T00:54:31Z

Hi @tgravescs, I added the perf test in descriptiton for CPU on Standalone, we can have 2x speed up for CPU for COALESCING reading

tgravescs

overall its fine, I would like to look into the footer estimate more though as it seem like it really overestimated.

wbo4958 added 2 commits July 13, 2021 09:31

Support orc coalescing reading

ee83288

Signed-off-by: Bobby Wang <wbo4958@gmail.com>

update doc

5f4c54f

wbo4958 force-pushed the orc-coalesce branch from 0f87ab5 to 5f4c54f Compare July 13, 2021 01:57

wbo4958 added 2 commits July 13, 2021 21:03

re-design the OrcPartitionReaderContext

b7b4e5c

Support for different compression files coalescing

e1aec3a

wbo4958 force-pushed the orc-coalesce branch from 95f34a0 to e1aec3a Compare July 14, 2021 03:35

tgravescs assigned wbo4958 Jul 14, 2021

tgravescs added the P0 Must have for release label Jul 14, 2021

tgravescs added this to the July 5 - July 16 milestone Jul 14, 2021

tgravescs added the feature request New feature or request label Jul 14, 2021

tgravescs reviewed Jul 14, 2021

View reviewed changes

docs/configs.md Outdated Show resolved Hide resolved

tgravescs reviewed Jul 14, 2021

View reviewed changes

wbo4958 added 3 commits July 15, 2021 08:18

resolve comments

dbee5d6

add ExtraInfo to contain different required information

5938372

resolve comments

e0b92f3

wbo4958 changed the title ~~[WIP] Support orc coalescing reading~~ Support orc coalescing reading Jul 15, 2021

wbo4958 marked this pull request as ready for review July 15, 2021 11:36

wbo4958 requested a review from tgravescs July 15, 2021 11:39

tgravescs reviewed Jul 15, 2021

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala Outdated Show resolved Hide resolved

resolve comments

5871ba2

tgravescs approved these changes Jul 16, 2021

View reviewed changes

wbo4958 merged commit 55a2cee into NVIDIA:branch-21.08 Jul 16, 2021

wbo4958 deleted the orc-coalesce branch July 16, 2021 03:22

This was referenced Jul 16, 2021

Improve the over-estimating size when doing ORC COALESCING reading #2945

Closed

[FEA] Support ORC small files coalescing reading #2800

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support orc coalescing reading #2909

Support orc coalescing reading #2909

wbo4958 commented Jul 12, 2021 •

edited

Loading

wbo4958 commented Jul 12, 2021

tgravescs commented Jul 14, 2021

tgravescs commented Jul 14, 2021 •

edited

Loading

tgravescs left a comment

tgravescs Jul 14, 2021

wbo4958 Jul 15, 2021

tgravescs Jul 14, 2021

wbo4958 Jul 15, 2021

tgravescs Jul 15, 2021

wbo4958 Jul 16, 2021

wbo4958 commented Jul 14, 2021

wbo4958 commented Jul 14, 2021

tgravescs commented Jul 14, 2021

wbo4958 commented Jul 15, 2021

wbo4958 commented Jul 15, 2021

tgravescs commented Jul 15, 2021

tgravescs commented Jul 15, 2021

wbo4958 commented Jul 16, 2021

wbo4958 commented Jul 16, 2021 •

edited

Loading

tgravescs left a comment

Support orc coalescing reading #2909

Support orc coalescing reading #2909

Conversation

wbo4958 commented Jul 12, 2021 • edited Loading

This PR is to support ORC coalescing reading.

Below is the performance test result.

Performance on Standalone

Performance on Non-partitioned ORC files in Databricks

Performance on Partitioned ORC files in Databricks

result comparison for CPU and COALESCING

Re-design the OrcPartitionReaderContext to fix orc reader leak issue

wbo4958 commented Jul 12, 2021

tgravescs commented Jul 14, 2021

tgravescs commented Jul 14, 2021 • edited Loading

tgravescs left a comment

Choose a reason for hiding this comment

tgravescs Jul 14, 2021

Choose a reason for hiding this comment

wbo4958 Jul 15, 2021

Choose a reason for hiding this comment

tgravescs Jul 14, 2021

Choose a reason for hiding this comment

wbo4958 Jul 15, 2021

Choose a reason for hiding this comment

tgravescs Jul 15, 2021

Choose a reason for hiding this comment

wbo4958 Jul 16, 2021

Choose a reason for hiding this comment

wbo4958 commented Jul 14, 2021

wbo4958 commented Jul 14, 2021

tgravescs commented Jul 14, 2021

wbo4958 commented Jul 15, 2021

wbo4958 commented Jul 15, 2021

tgravescs commented Jul 15, 2021

tgravescs commented Jul 15, 2021

wbo4958 commented Jul 16, 2021

wbo4958 commented Jul 16, 2021 • edited Loading

tgravescs left a comment

Choose a reason for hiding this comment

wbo4958 commented Jul 12, 2021 •

edited

Loading

tgravescs commented Jul 14, 2021 •

edited

Loading

wbo4958 commented Jul 16, 2021 •

edited

Loading