-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] [Audit]: Set the list of read columns in the task configuration to reduce reading of ORC data #3026
Comments
I think we don't have to set this conf because spark-rapids customized the entire process of ORC reading. @wbo4958 has better knowledge on this topic. |
The question is do we already only read the necessary columns or would setting OrcConf.INCLUDE_COLUMN help us read less data. |
Hi @tgravescs, with some investigation, I don't think In Spark, this conf affects the construction of In spark-rapids, there is a specialized helper function /**
* Compute an array of booleans, one for each column in the ORC file, indicating whether the
* corresponding ORC column ID should be included in the file to be loaded by the GPU.
*
* @param evolution ORC schema evolution instance
* @return per-column inclusion flags
*/
private def calcOrcFileIncluded(evolution: SchemaEvolution): Array[Boolean] = {
if (requestedMapping.isDefined) {
// ORC schema has no column names, so need to filter based on index
val orcSchema = orcReader.getSchema
val topFields = orcSchema.getChildren
val numFlattenedCols = orcSchema.getMaximumId
val included = new Array[Boolean](numFlattenedCols + 1)
util.Arrays.fill(included, false)
// first column is the top-level schema struct, always add it
included(0) = true
// find each top-level column requested by top-level index and add it and all child columns
requestedMapping.get.foreach { colIdx =>
val field = topFields.get(colIdx)
(field.getId to field.getMaximumId).foreach { i =>
included(i) = true
}
}
included
} else {
evolution.getFileIncluded
}
} The |
We are pruning columns manually, but we also call |
Closes #3026 Following SPARK-35783, set OrcConf.INCLUDE_COLUMNS. Signed-off-by: sperlingxx <lovedreamf@gmail.com>
Is your feature request related to a problem? Please describe.
Below PR sets the list of read columns in the task configuration to reduce reading of ORC data. Need to check if it should be done in GpuOrcScan.scala
PR: apache/spark@947c7ea27c
The text was updated successfully, but these errors were encountered: