Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] [Audit]: Set the list of read columns in the task configuration to reduce reading of ORC data #3026

Closed
nartal1 opened this issue Jul 27, 2021 · 4 comments · Fixed by #4933
Assignees
Labels
audit_3.2.0 P1 Nice to have for release performance A performance related task/issue

Comments

@nartal1
Copy link
Collaborator

nartal1 commented Jul 27, 2021

Is your feature request related to a problem? Please describe.
Below PR sets the list of read columns in the task configuration to reduce reading of ORC data. Need to check if it should be done in GpuOrcScan.scala

PR: apache/spark@947c7ea27c

@nartal1 nartal1 added feature request New feature or request ? - Needs Triage Need team to review and classify audit_3.2.0 labels Jul 27, 2021
@Salonijain27 Salonijain27 added P1 Nice to have for release performance A performance related task/issue and removed ? - Needs Triage Need team to review and classify feature request New feature or request labels Jul 27, 2021
@sperlingxx
Copy link
Collaborator

I think we don't have to set this conf because spark-rapids customized the entire process of ORC reading. @wbo4958 has better knowledge on this topic.

@tgravescs
Copy link
Collaborator

The question is do we already only read the necessary columns or would setting OrcConf.INCLUDE_COLUMN help us read less data.

@sperlingxx sperlingxx self-assigned this Mar 9, 2022
@sperlingxx
Copy link
Collaborator

sperlingxx commented Mar 9, 2022

Hi @tgravescs, with some investigation, I don't think OrcConf.INCLUDE_COLUMN can help us through pruning data.

In Spark, this conf affects the construction of SchemaEvolution by changing Reader.Options. In details, the effect of OrcConf.INCLUDE_COLUMN will be reflected on SchemaEvolution.fileIncluded, which determines which file to be read through the stripe parsing of RecordReaderImpl.

In spark-rapids, there is a specialized helper function calOrcFileIncluded to wipe out unnecessary orc files while building OrcOutputStripes:

    /**
     * Compute an array of booleans, one for each column in the ORC file, indicating whether the
     * corresponding ORC column ID should be included in the file to be loaded by the GPU.
     *
     * @param evolution ORC schema evolution instance
     * @return per-column inclusion flags
     */
    private def calcOrcFileIncluded(evolution: SchemaEvolution): Array[Boolean] = {
      if (requestedMapping.isDefined) {
        // ORC schema has no column names, so need to filter based on index
        val orcSchema = orcReader.getSchema
        val topFields = orcSchema.getChildren
        val numFlattenedCols = orcSchema.getMaximumId
        val included = new Array[Boolean](numFlattenedCols + 1)
        util.Arrays.fill(included, false)
        // first column is the top-level schema struct, always add it
        included(0) = true
        // find each top-level column requested by top-level index and add it and all child columns
        requestedMapping.get.foreach { colIdx =>
          val field = topFields.get(colIdx)
          (field.getId to field.getMaximumId).foreach { i =>
            included(i) = true
          }
        }
        included
      } else {
        evolution.getFileIncluded
      }
    }

The calOrcFileIncluded does the same pruning work as the conf OrcConf.INCLUDE_COLUMN. In other words, we can replace this function with evolution.getFileIncluded, after setting OrcConf.INCLUDE_COLUMN.

Hi @wbo4958 @jlowe, please correct me if I am wrong :)

@jlowe
Copy link
Member

jlowe commented Mar 9, 2022

We are pruning columns manually, but we also call OrcInputFormat.buildOptions which in turn examines the INCLUDE_COLUMN property to setup the reader options. This Reader.Option instance is then used in various places. I wouldn't be surprised if it does not matter in practice, but IMHO it would be better to setup the Reader.Options instance properly for the included files in case it is important for the ORC code we're calling, either today or in the future.

sperlingxx added a commit that referenced this issue Mar 19, 2022
Closes #3026

Following SPARK-35783, set OrcConf.INCLUDE_COLUMNS.

Signed-off-by: sperlingxx <lovedreamf@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audit_3.2.0 P1 Nice to have for release performance A performance related task/issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants