Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle readBatch changes for Spark 3.3.0 #5425

Merged
merged 9 commits into from
May 9, 2022

Conversation

razajafri
Copy link
Collaborator

@razajafri razajafri commented May 4, 2022

The existing class CurrentBatchIterator has been changed to a package-private abstract class. There are two implementations of this class, one for pre-spark-3.3.0 and the other for Spark3.3.0. Major difference between the two is the Spark-3.3.0 version uses ParquetColumn to read the parquet file while the other uses VectorizedColumnReaders

fixes #5257
fixes #5357
fixes #5429

Signed-off-by: Raza Jafri rjafri@nvidia.com

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri razajafri self-assigned this May 4, 2022
object ParquetVectorizedReader {
private var readBatchMethod: Method = null
def getReadBatchMethod(): Method = {
if (readBatchMethod == null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider lazy val instead of manual memoization

readBatchMethod =
classOf[VectorizedColumnReader].getDeclaredMethod("readBatch", Integer.TYPE,
classOf[WritableColumnVector])
readBatchMethod.setAccessible(true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider org.apache.commons.lang3.reflect.MethodUtils for such tasks

Copy link
Collaborator Author

@razajafri razajafri May 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the advantage of using MethodUtils?
It doesn't reduce the number of lines of code

val method = MethodUtils.getMatchingMethod(classOf[VectorizedColumnReader], "readBatch", Integer.TYPE,  classOf[WritableColumnVector])
method.setAccessible(true)

Nor do we run in JRE prior to 1.4 where MethodUtils has a work around.

IMO this would bring a 3rd party in to the picture when it's not really adding any value. Please feel free to elaborate on why you made the suggestion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a big deal but there are invokeMethod flavors where you can just pass forceAccess = true

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will look up the method every time as seen here. I don't think that's what we want

}
}

for (i <- missingColumns.indices) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like we could save this extra loop if we moved L100-101 to after 94 missingColumns(i) = true

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@sameerz sameerz added the audit_3.3.0 Audit related tasks for 3.3.0 label May 5, 2022
@razajafri
Copy link
Collaborator Author

@gerashegalov do you have any more suggestions?

@razajafri
Copy link
Collaborator Author

build

@nartal1
Copy link
Collaborator

nartal1 commented May 6, 2022

I am bit confused on the description of the PR. It's mentioned as it fixes this issue: #5257. Here we are throwing exception that Complex types are not supported for both Spark-3.3 and prior versions. There was some discussion on that issue to support for complex type. Could you please clarify if we should close the issue once this PR is merged OR keep it open for future work.

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

I am bit confused on the description of the PR. It's mentioned as it fixes this issue: #5257. Here we are throwing exception that Complex types are not supported for both Spark-3.3 and prior versions. There was some discussion on that issue to support for complex type. Could you please clarify if we should close the issue once this PR is merged OR keep it open for future work.

Great catch @nartal1 I typed the wrong error message. I have fixed it PTAL

nartal1
nartal1 previously approved these changes May 6, 2022
Copy link
Collaborator

@nartal1 nartal1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one nit. Not mandatory to address. LGTM.

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

@nartal1 PTAL

nartal1
nartal1 previously approved these changes May 6, 2022
gerashegalov
gerashegalov previously approved these changes May 6, 2022
Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, provided we have tests exercising this logic for 3.3.0

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri razajafri dismissed stale reviews from gerashegalov and nartal1 via a1a9bd9 May 6, 2022 22:30
@razajafri
Copy link
Collaborator Author

@nartal1 @gerashegalov
I reverted the test marked to xfail. PTAL

@nartal1
Copy link
Collaborator

nartal1 commented May 6, 2022

build

@sameerz sameerz added this to the May 2 - May 20 milestone May 9, 2022
@razajafri razajafri merged commit b7bb209 into NVIDIA:branch-22.06 May 9, 2022
@razajafri razajafri deleted the SP-5257-readBatch-changes branch May 9, 2022 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audit_3.3.0 Audit related tasks for 3.3.0
Projects
None yet
4 participants