Sanitize column names in ParquetCachedBatchSerializer before writing to Parquet [databricks] #4258

razajafri · 2021-12-01T22:29:38Z

There is a case where we use the original schema sent by Spark which has names that haven't been sanitized.

This PR sets the names before creating a Hadoop conf.

Signed-off-by: Raza Jafri rjafri@nvidia.com

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2021-12-01T22:29:56Z

build

razajafri · 2021-12-02T00:33:58Z

build

gerashegalov · 2021-12-02T08:24:56Z

can you create, reference an issue that is being solved by this PR?

gerashegalov · 2021-12-02T08:23:55Z

.../src/main/311+-all/scala/com/nvidia/spark/rapids/shims/v2/ParquetCachedBatchSerializer.scala

+  // We want to change the original schema to have the new names as well
+  private def sanitizeColumnNames(originalSchema: Seq[Attribute],
+      schemaToCopyNamesFrom: Seq[Attribute]): Seq[Attribute] = {
+    originalSchema.zip(schemaToCopyNamesFrom).map(t => t._1.withName(t._2.name))


nit: prefer pattern-match to _1, _2, e.g.:

originalSchema.zip(schemaToCopyNamesFrom).map { case (origAttr, newAttr) => origAttr.withName(newAttr.name) }

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2021-12-03T13:51:54Z

build

gerashegalov · 2021-12-03T18:40:56Z

#3879 is an epic. What point of #3879 are you working on? It sounds that you are fixing a bug. Can you point to a github bug issue with the problem description (repro and exception) or create one if there is none yet.

razajafri · 2021-12-06T05:13:20Z

#3879 is an epic. What point of #3879 are you working on? It sounds that you are fixing a bug. Can you point to a github bug issue with the problem description (repro and exception) or create one if there is none yet.

Sorry, I pasted the wrong link. This fixes a bug

.../src/main/311+-all/scala/com/nvidia/spark/rapids/shims/v2/ParquetCachedBatchSerializer.scala

mythrocks

Generally 👍. Minor nitpick.

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2021-12-06T21:03:04Z

build

Sanitize col names

c9229bd

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri self-assigned this Dec 1, 2021

razajafri changed the base branch from branch-22.02 to branch-21.12 December 2, 2021 00:33

gerashegalov reviewed Dec 2, 2021

View reviewed changes

sameerz added the bug Something isn't working label Dec 3, 2021

sameerz added this to the Nov 30 - Dec 10 milestone Dec 3, 2021

addressed review comments

0437c8f

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri requested a review from gerashegalov December 3, 2021 13:53

jlowe previously approved these changes Dec 6, 2021

View reviewed changes

mythrocks reviewed Dec 6, 2021

View reviewed changes

.../src/main/311+-all/scala/com/nvidia/spark/rapids/shims/v2/ParquetCachedBatchSerializer.scala Outdated Show resolved Hide resolved

mythrocks requested changes Dec 6, 2021

View reviewed changes

addressed review comments

8a9a789

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri dismissed jlowe’s stale review via 8a9a789 December 6, 2021 20:58

mythrocks approved these changes Dec 6, 2021

View reviewed changes

sameerz merged commit d03e51d into NVIDIA:branch-21.12 Dec 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sanitize column names in ParquetCachedBatchSerializer before writing to Parquet [databricks] #4258

Sanitize column names in ParquetCachedBatchSerializer before writing to Parquet [databricks] #4258

razajafri commented Dec 1, 2021 •

edited by revans2

Loading

razajafri commented Dec 1, 2021

razajafri commented Dec 2, 2021

gerashegalov commented Dec 2, 2021

gerashegalov Dec 2, 2021

razajafri commented Dec 3, 2021

gerashegalov commented Dec 3, 2021

razajafri commented Dec 6, 2021 •

edited

Loading

mythrocks left a comment

razajafri commented Dec 6, 2021

Sanitize column names in ParquetCachedBatchSerializer before writing to Parquet [databricks] #4258

Sanitize column names in ParquetCachedBatchSerializer before writing to Parquet [databricks] #4258

Conversation

razajafri commented Dec 1, 2021 • edited by revans2 Loading

razajafri commented Dec 1, 2021

razajafri commented Dec 2, 2021

gerashegalov commented Dec 2, 2021

gerashegalov Dec 2, 2021

Choose a reason for hiding this comment

razajafri commented Dec 3, 2021

gerashegalov commented Dec 3, 2021

razajafri commented Dec 6, 2021 • edited Loading

mythrocks left a comment

Choose a reason for hiding this comment

razajafri commented Dec 6, 2021

razajafri commented Dec 1, 2021 •

edited by revans2

Loading

razajafri commented Dec 6, 2021 •

edited

Loading