Handle readBatch changes for Spark 3.3.0 #5425

razajafri · 2022-05-04T23:31:11Z

The existing class CurrentBatchIterator has been changed to a package-private abstract class. There are two implementations of this class, one for pre-spark-3.3.0 and the other for Spark3.3.0. Major difference between the two is the Spark-3.3.0 version uses ParquetColumn to read the parquet file while the other uses VectorizedColumnReaders

fixes #5257
fixes #5357
fixes #5429

Signed-off-by: Raza Jafri rjafri@nvidia.com

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

gerashegalov · 2022-05-04T23:49:13Z

...ondb/scala/org/apache/spark/sql/execution/datasources/parquet/ShimCurrentBatchIterator.scala

+object ParquetVectorizedReader {
+  private var readBatchMethod: Method = null
+  def getReadBatchMethod(): Method = {
+    if (readBatchMethod == null) {


consider lazy val instead of manual memoization

gerashegalov · 2022-05-04T23:50:19Z

...ondb/scala/org/apache/spark/sql/execution/datasources/parquet/ShimCurrentBatchIterator.scala

+      readBatchMethod =
+        classOf[VectorizedColumnReader].getDeclaredMethod("readBatch", Integer.TYPE,
+          classOf[WritableColumnVector])
+      readBatchMethod.setAccessible(true)


Consider org.apache.commons.lang3.reflect.MethodUtils for such tasks

What is the advantage of using MethodUtils?
It doesn't reduce the number of lines of code

val method = MethodUtils.getMatchingMethod(classOf[VectorizedColumnReader], "readBatch", Integer.TYPE, classOf[WritableColumnVector]) method.setAccessible(true)

Nor do we run in JRE prior to 1.4 where MethodUtils has a work around.

IMO this would bring a 3rd party in to the picture when it's not really adding any value. Please feel free to elaborate on why you made the suggestion.

not a big deal but there are invokeMethod flavors where you can just pass forceAccess = true

That will look up the method every time as seen here. I don't think that's what we want

gerashegalov · 2022-05-04T23:59:38Z

...ondb/scala/org/apache/spark/sql/execution/datasources/parquet/ShimCurrentBatchIterator.scala

+    }
+  }
+
+  for (i <- missingColumns.indices) {


seems like we could save this extra loop if we moved L100-101 to after 94 missingColumns(i) = true

You are right

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2022-05-05T23:36:35Z

@gerashegalov do you have any more suggestions?

razajafri · 2022-05-05T23:36:43Z

build

nartal1 · 2022-05-06T18:55:41Z

I am bit confused on the description of the PR. It's mentioned as it fixes this issue: #5257. Here we are throwing exception that Complex types are not supported for both Spark-3.3 and prior versions. There was some discussion on that issue to support for complex type. Could you please clarify if we should close the issue once this PR is merged OR keep it open for future work.

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2022-05-06T19:31:19Z

I am bit confused on the description of the PR. It's mentioned as it fixes this issue: #5257. Here we are throwing exception that Complex types are not supported for both Spark-3.3 and prior versions. There was some discussion on that issue to support for complex type. Could you please clarify if we should close the issue once this PR is merged OR keep it open for future work.

Great catch @nartal1 I typed the wrong error message. I have fixed it PTAL

nartal1

Just one nit. Not mandatory to address. LGTM.

...330+/scala/org/apache/spark/sql/execution/datasources/parquet/ShimCurrentBatchIterator.scala

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2022-05-06T20:52:49Z

build

razajafri · 2022-05-06T20:59:17Z

@nartal1 PTAL

gerashegalov

LGTM, provided we have tests exercising this logic for 3.3.0

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2022-05-06T22:32:54Z

@nartal1 @gerashegalov
I reverted the test marked to xfail. PTAL

nartal1 · 2022-05-06T22:54:58Z

build

razajafri added 4 commits May 4, 2022 12:29

Added a shim for CurrentBatchIterator

25d6e47

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Extracted super class for common code

a7aca66

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

removed the duplicate code from child class

fdaf82d

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

removed redundant code

dcdb595

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri self-assigned this May 4, 2022

gerashegalov reviewed May 5, 2022

View reviewed changes

addressed review comments

2883ce6

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

sameerz added the audit_3.3.0 Audit related tasks for 3.3.0 label May 5, 2022

addressed review comments

732341e

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

nartal1 previously approved these changes May 6, 2022

View reviewed changes

...330+/scala/org/apache/spark/sql/execution/datasources/parquet/ShimCurrentBatchIterator.scala Outdated Show resolved Hide resolved

added new line

ca9f78b

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri dismissed nartal1’s stale review via ca9f78b May 6, 2022 20:52

nartal1 previously approved these changes May 6, 2022

View reviewed changes

gerashegalov previously approved these changes May 6, 2022

View reviewed changes

razajafri added 2 commits May 6, 2022 15:26

Merge remote-tracking branch 'origin/branch-22.06' into HEAD

df2b0f6

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

removed the xfailing test

a1a9bd9

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri dismissed stale reviews from gerashegalov and nartal1 via a1a9bd9 May 6, 2022 22:30

nartal1 approved these changes May 6, 2022

View reviewed changes

gerashegalov approved these changes May 6, 2022

View reviewed changes

sameerz added this to the May 2 - May 20 milestone May 9, 2022

razajafri merged commit b7bb209 into NVIDIA:branch-22.06 May 9, 2022

razajafri deleted the SP-5257-readBatch-changes branch May 9, 2022 16:51

GaryShen2008 mentioned this pull request May 10, 2022

[BUG] build failed on Databricks #5444

Closed

tgravescs mentioned this pull request May 27, 2022

[BUG]test_cache_expand_exec is failing on 330 #5673

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle readBatch changes for Spark 3.3.0 #5425

Handle readBatch changes for Spark 3.3.0 #5425

razajafri commented May 4, 2022 •

edited

Loading

gerashegalov May 4, 2022

gerashegalov May 4, 2022

razajafri May 5, 2022 •

edited

Loading

gerashegalov May 5, 2022

razajafri May 6, 2022

gerashegalov May 4, 2022

razajafri May 5, 2022

razajafri commented May 5, 2022

razajafri commented May 5, 2022

nartal1 commented May 6, 2022

razajafri commented May 6, 2022

nartal1 left a comment

razajafri commented May 6, 2022

razajafri commented May 6, 2022

gerashegalov left a comment

razajafri commented May 6, 2022

nartal1 commented May 6, 2022

Handle readBatch changes for Spark 3.3.0 #5425

Handle readBatch changes for Spark 3.3.0 #5425

Conversation

razajafri commented May 4, 2022 • edited Loading

gerashegalov May 4, 2022

Choose a reason for hiding this comment

gerashegalov May 4, 2022

Choose a reason for hiding this comment

razajafri May 5, 2022 • edited Loading

Choose a reason for hiding this comment

gerashegalov May 5, 2022

Choose a reason for hiding this comment

razajafri May 6, 2022

Choose a reason for hiding this comment

gerashegalov May 4, 2022

Choose a reason for hiding this comment

razajafri May 5, 2022

Choose a reason for hiding this comment

razajafri commented May 5, 2022

razajafri commented May 5, 2022

nartal1 commented May 6, 2022

razajafri commented May 6, 2022

nartal1 left a comment

Choose a reason for hiding this comment

razajafri commented May 6, 2022

razajafri commented May 6, 2022

gerashegalov left a comment

Choose a reason for hiding this comment

razajafri commented May 6, 2022

nartal1 commented May 6, 2022

razajafri commented May 4, 2022 •

edited

Loading

razajafri May 5, 2022 •

edited

Loading