[FEA] [Audit]: Set the list of read columns in the task configuration to reduce reading of ORC data #3026

nartal1 · 2021-07-27T00:01:33Z

Is your feature request related to a problem? Please describe.
Below PR sets the list of read columns in the task configuration to reduce reading of ORC data. Need to check if it should be done in GpuOrcScan.scala

PR: apache/spark@947c7ea27c

The text was updated successfully, but these errors were encountered:

sperlingxx · 2022-03-07T09:25:47Z

I think we don't have to set this conf because spark-rapids customized the entire process of ORC reading. @wbo4958 has better knowledge on this topic.

tgravescs · 2022-03-07T15:15:51Z

The question is do we already only read the necessary columns or would setting OrcConf.INCLUDE_COLUMN help us read less data.

sperlingxx · 2022-03-09T06:53:33Z

Hi @tgravescs, with some investigation, I don't think OrcConf.INCLUDE_COLUMN can help us through pruning data.

In Spark, this conf affects the construction of SchemaEvolution by changing Reader.Options. In details, the effect of OrcConf.INCLUDE_COLUMN will be reflected on SchemaEvolution.fileIncluded, which determines which file to be read through the stripe parsing of RecordReaderImpl.

In spark-rapids, there is a specialized helper function calOrcFileIncluded to wipe out unnecessary orc files while building OrcOutputStripes:

    /**
     * Compute an array of booleans, one for each column in the ORC file, indicating whether the
     * corresponding ORC column ID should be included in the file to be loaded by the GPU.
     *
     * @param evolution ORC schema evolution instance
     * @return per-column inclusion flags
     */
    private def calcOrcFileIncluded(evolution: SchemaEvolution): Array[Boolean] = {
      if (requestedMapping.isDefined) {
        // ORC schema has no column names, so need to filter based on index
        val orcSchema = orcReader.getSchema
        val topFields = orcSchema.getChildren
        val numFlattenedCols = orcSchema.getMaximumId
        val included = new Array[Boolean](numFlattenedCols + 1)
        util.Arrays.fill(included, false)
        // first column is the top-level schema struct, always add it
        included(0) = true
        // find each top-level column requested by top-level index and add it and all child columns
        requestedMapping.get.foreach { colIdx =>
          val field = topFields.get(colIdx)
          (field.getId to field.getMaximumId).foreach { i =>
            included(i) = true
          }
        }
        included
      } else {
        evolution.getFileIncluded
      }
    }

The calOrcFileIncluded does the same pruning work as the conf OrcConf.INCLUDE_COLUMN. In other words, we can replace this function with evolution.getFileIncluded, after setting OrcConf.INCLUDE_COLUMN.

Hi @wbo4958 @jlowe, please correct me if I am wrong :)

jlowe · 2022-03-09T15:17:32Z

We are pruning columns manually, but we also call OrcInputFormat.buildOptions which in turn examines the INCLUDE_COLUMN property to setup the reader options. This Reader.Option instance is then used in various places. I wouldn't be surprised if it does not matter in practice, but IMHO it would be better to setup the Reader.Options instance properly for the included files in case it is important for the ORC code we're calling, either today or in the future.

Closes #3026 Following SPARK-35783, set OrcConf.INCLUDE_COLUMNS. Signed-off-by: sperlingxx <lovedreamf@gmail.com>

nartal1 added feature request New feature or request ? - Needs Triage Need team to review and classify audit_3.2.0 labels Jul 27, 2021

Salonijain27 added P1 Nice to have for release performance A performance related task/issue and removed ? - Needs Triage Need team to review and classify feature request New feature or request labels Jul 27, 2021

sperlingxx self-assigned this Mar 9, 2022

sperlingxx mentioned this issue Mar 11, 2022

Set OrcConf.INCLUDE_COLUMNS for ORC reading #4933

Merged

sperlingxx closed this as completed in #4933 Mar 19, 2022

sperlingxx added a commit that referenced this issue Mar 19, 2022

Set OrcConf.INCLUDE_COLUMNS for ORC reading (#4933)

ae8f21b

Closes #3026 Following SPARK-35783, set OrcConf.INCLUDE_COLUMNS. Signed-off-by: sperlingxx <lovedreamf@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] [Audit]: Set the list of read columns in the task configuration to reduce reading of ORC data #3026

[FEA] [Audit]: Set the list of read columns in the task configuration to reduce reading of ORC data #3026

nartal1 commented Jul 27, 2021

sperlingxx commented Mar 7, 2022

tgravescs commented Mar 7, 2022

sperlingxx commented Mar 9, 2022 •

edited

Loading

jlowe commented Mar 9, 2022

[FEA] [Audit]: Set the list of read columns in the task configuration to reduce reading of ORC data #3026

[FEA] [Audit]: Set the list of read columns in the task configuration to reduce reading of ORC data #3026

Comments

nartal1 commented Jul 27, 2021

sperlingxx commented Mar 7, 2022

tgravescs commented Mar 7, 2022

sperlingxx commented Mar 9, 2022 • edited Loading

jlowe commented Mar 9, 2022

sperlingxx commented Mar 9, 2022 •

edited

Loading