-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sanitize column names in ParquetCachedBatchSerializer before writing to Parquet [databricks] #4258
Conversation
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
build |
build |
can you create, reference an issue that is being solved by this PR? |
// We want to change the original schema to have the new names as well | ||
private def sanitizeColumnNames(originalSchema: Seq[Attribute], | ||
schemaToCopyNamesFrom: Seq[Attribute]): Seq[Attribute] = { | ||
originalSchema.zip(schemaToCopyNamesFrom).map(t => t._1.withName(t._2.name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: prefer pattern-match to _1, _2, e.g.:
originalSchema.zip(schemaToCopyNamesFrom).map {
case (origAttr, newAttr) => origAttr.withName(newAttr.name)
}
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
build |
.../src/main/311+-all/scala/com/nvidia/spark/rapids/shims/v2/ParquetCachedBatchSerializer.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally 👍. Minor nitpick.
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
build |
There is a case where we use the original schema sent by Spark which has names that haven't been
sanitized
.This PR sets the names before creating a Hadoop conf.
Signed-off-by: Raza Jafri rjafri@nvidia.com