-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle readBatch changes for Spark 3.3.0 #5425
Handle readBatch changes for Spark 3.3.0 #5425
Conversation
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
object ParquetVectorizedReader { | ||
private var readBatchMethod: Method = null | ||
def getReadBatchMethod(): Method = { | ||
if (readBatchMethod == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider lazy val
instead of manual memoization
readBatchMethod = | ||
classOf[VectorizedColumnReader].getDeclaredMethod("readBatch", Integer.TYPE, | ||
classOf[WritableColumnVector]) | ||
readBatchMethod.setAccessible(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider org.apache.commons.lang3.reflect.MethodUtils
for such tasks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the advantage of using MethodUtils
?
It doesn't reduce the number of lines of code
val method = MethodUtils.getMatchingMethod(classOf[VectorizedColumnReader], "readBatch", Integer.TYPE, classOf[WritableColumnVector])
method.setAccessible(true)
Nor do we run in JRE prior to 1.4 where MethodUtils
has a work around.
IMO this would bring a 3rd party in to the picture when it's not really adding any value. Please feel free to elaborate on why you made the suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a big deal but there are invokeMethod flavors where you can just pass forceAccess = true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will look up the method every time as seen here. I don't think that's what we want
} | ||
} | ||
|
||
for (i <- missingColumns.indices) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like we could save this extra loop if we moved L100-101 to after 94 missingColumns(i) = true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@gerashegalov do you have any more suggestions? |
build |
I am bit confused on the description of the PR. It's mentioned as it fixes this issue: #5257. Here we are throwing exception that |
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Great catch @nartal1 I typed the wrong error message. I have fixed it PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one nit. Not mandatory to address. LGTM.
...330+/scala/org/apache/spark/sql/execution/datasources/parquet/ShimCurrentBatchIterator.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
build |
@nartal1 PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, provided we have tests exercising this logic for 3.3.0
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@nartal1 @gerashegalov |
build |
The existing class
CurrentBatchIterator
has been changed to a package-private abstract class. There are two implementations of this class, one for pre-spark-3.3.0 and the other for Spark3.3.0. Major difference between the two is the Spark-3.3.0 version usesParquetColumn
to read the parquet file while the other usesVectorizedColumnReaders
fixes #5257
fixes #5357
fixes #5429
Signed-off-by: Raza Jafri rjafri@nvidia.com